Create and apply patch from a github pull request

You have a local clone of a github repository, somebody created a pull request against this repository and you would like to apply it to your clone before the maintainer of the origin actually merges it.

How to do that?

It is actually surprisingly neat. When you look at the url of a github PR:

https://github.com/torvalds/linux/pull/42

You can just add ‘.patch’ at the end of the url to get a nicely formatted email patch:

https://github.com/torvalds/linux/pull/42.patch

From there on, you have a few options. If you download the patch (in say pr.patch) at the root of your clone, you can apply it:

git am ./pr.patch

If you want to apply the code patch without actually apply the commits, you can use your old trusty patch command:

patch -p 1 < ./pr.patch

If you are lazy (as my director studies always said, ‘laziness drives progress’), you can do all in one line:

wget -q -O - 'https://github.com/torvalds/linux/pull/42.patch' | git am

Vertica upgrade or install: Default shell must be set to bash

When upgrading Vertica, it will check if bash is the default shell for the dbadmin user, and complain loudly if this is not the case:

Error: Default shell on the following nodes are not bash. Default shell must be set to bash.
10.0.0.1 for dbadmin:
10.0.0.2 for dbadmin:
Exiting...
Installation FAILED with errors.

Of course, if the shell is indeed not bash, you can fix it by running

chsh -s /bin/bash

on the relevant nodes as the dbadmin user.

In some cases, bash is indeed the shell, but due to ssh or sudo configuration, the installer does not see it. In that case, using visudo to edit the /etc/sudoers file just add the following lines:

Defaults:dbadmin !requiretty
dbadmin ALL=(ALL) NOPASSWD:ALL

Those lines are needed only at install, and can be reverted afterwards.

Replace a failed node in Vertica

Just a little semantic reminder before we dive in:

  • A host is a server on which Vertica is set up, but not necessarily used by a database. You can add hosts to a cluster to make them available to a database.
  • A node is a host part of a database.

If one of your node goes down (but the server is still up, I am thinking about data disk failure), according to the vertica documentation it is possible to replace it by another node with another IP address. This case never presented itself to me, so I will trust the documentation on that.

I had the issue of a dead host, though. In that case, the documentation is not enough. As part of the process of replacing a node, you need to add a new host to the cluster. While doing this via the update_vertica utility, Vertica will check connection between all hosts of the cluster. As one host is down, installation will fail.

In that case, the solution is not trivial but quite straightforward, and this is the goal of this post to explain it step by step.

  1. System preparation
  2. Update existing node info in the catalog via vsql
  3. Update admintools.conf
  4. Install Vertica on the new server
  5. Configure new node
  6. Restart new node

1 – System preparation

Make sure that Vertica is not installed and that /opt/vertica does not exist:

yum remove vertica
rm -rf /opt/vertica

2 – Update existing node info in the catalog via vsql

  • <failed_node_name>: the name of the host you want to replace, taken from the node_name column in v_catalog.nodes
  • <newip>: ip address of the replacement host
-- change the node IP
ALTER NODE &lt;failed_node_name&gt; HOSTNAME '&lt;newip&gt;';
-- change the nodes spread/control IP
ALTER NODE &lt;failed_node_name&gt; CONTROL HOSTNAME '&lt;newip&gt;';
-- re-write spread.conf with the new IP address and reload the running config
-- (db should remain UP)
SELECT RELOAD_SPREAD(true)

3 – Update admintools.conf

This must be done on a UP node. Any node will do, and in this post we will call it <source host>.

Edit the file /opt/vertica/conf/admintools.conf, by replacing all instances of the old ip address by the new ip address. This means that there will be 3 lines to update:

  • [Cluster] > hosts
  • [Nodes] : 2 lines, the one starting with the node name and the one with the node number.

For instance, assume we are replacing node2 from a 3-node cluster, from ip 10.0.0.2 to ip 10.0.0.42

Before, here were the relevant lines of admintools.conf:

[Cluster]
hosts = 10.0.0.1,10.0.0.2,10.0.0.3

[Nodes]
node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin
v_spil_dwh_node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin

After, notice the parts in bold:

[Cluster]
hosts = 10.0.0.1,10.0.0.42,10.0.0.3

[Nodes]
node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin
v_spil_dwh_node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin

4 – Install Vertica on the new server

On the same host used in the previous step, <source host> use update_vertica to add the new host to the cluster. The mandatory options are --rpm and the options you used at install time (which you can find in /opt/vertica/config/admintools.conf) or the path to your config file (--config-file/-z) if you used one.

Do NOT use the -S/--control-network, -A/--add-hosts or -R/--remove-hosts switches. You most likely will use -u/--dba-user, -g/--dba-group, -p/--dba-user-password and maybe a few more.

sudo /opt/vertica/sbin/update_vertica --rpm <complete path of RPM> -u <user> -g <group> ...

This script will verify all hosts, and will install the rpm on the new one.

5 – Configure new node

Login as dbadmin (or whichever user is your database administrator) on the new node, and recreate the base and data directory as they were in the failed node. Assuming that:

  • the failed node was node2,
  • your database is named $dwh,
  • your base directory is /home/dbadmin,

Then create the following:

mkdir /home/dbadmin/$dwh
mkdir /home/dbadmin/$dwh/v_$dwh_node0002_data
mkdir /home/dbadmin/$dwh/v_$dwh_node0002_catalog

You can look on any UP node to have an example of hierarchy.

Still from the node where you edited the config file, <source_host>, distribute them via the admintools:

  • run as dbadmin user /opt/vertica/bin/admintools
  • go to Configuration Menu > Distribute Config Files
  • select Database Configuration and Admintools Meta-Data.

If you cannot find spread.conf under /home/dbadmin/$dwh/v_$dwh_node0002_catalog copy it from <source_host>. You can check that spread.conf now have the IP 10.0.0.42 instead of 10.0.0.2.

Finally, as a last sanity check, have a look at /opt/vertica/config/admintools.conf and make sure that the new IP appears instead of the old one.

6 – Restart new node

Use the the admintools /opt/vertica/bin/admintools and select “Restart Vertica on Host”. The node wil then start the recovery process. All missing data will be copied to it, and once done it will join the cluster which will be complete again.

Vertica: move_partitions_to_table()

If you have a load in multiple steps (say data is first loaded in a staging schema, then moved to another one), using MOVE_PARTITIONS_TO_TABLE() is the best option. This is very fast, as you do not need to actually issue INSERT or DELETE statements or deal with the delete vectors. Vertica just moves in one go a block of data, and you are done.

When you issue this function, Vertica will first issue a moveout to empty the WOS before moving data. This does make sense, as if data is loaded in memory, you still want to make sure it is moved with the rest of the of the partition already on disk.

Since at least Vertica 7.0.1-1, using this function might give you the error message:

select move_partitions_to_table ('stg.A', 0, 24, 'vault.A')
ERROR: A Moveout operation is already in progress on projection vault.B_node0002

As you can see, a moveout on table A fails because of a moveout on table B, which does not seem to make sense.

It has actually been confirmed that it does not make sense, and it actually is a bug in Vertica 7.0.1-1, which should be fixed in theory in the next service pack.

In the mean time, you can work around the bug by issuing INSERTs/DELETEs, or have your code checks if a moveout is already running before issuing the move_partitions_to_table() statement.

EDIT: The hotfix 7.1.1-1 (11/2014) fixes the bug.

Avro end to end in hdfs – part 4: problems and solutions

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume
  3. How to use avro with hive
  4. Problems and solutions (This post)

Invalid/non standard schemas

The avro tools available for different languages are not all exactly equivalent. The default one for java used in Hadoop, for instance, has issues when some fields can be set to null. Nested array are another issues in a lot of cases. The default avro parser from java cannot handle them properly. Furthermore, if you end up finding a way to generate avro files with nested arrays, some tools will not be able to read them. Hive will be fine, but Impala (as of version 1.2) is not able to read them.

I can only urge you to use simple schemas, this will make your life a lot easier.

Hive partitions and schema changes

If you use Hive partitions (and you should), all data in one specific partition must have the same schema. We used to have partitions per hour when loading some logs, but now we are actually adding the avro schema version in the partition path. That way, data encoded in a new schema will end up in a different partition even if data is related to the same hour.

Faster encoding and flexibility

We started loading data the standard way, via flume. This created a lot of issues as explained earlier (nested arrays mostly), and flume was actually using a lot of resources. We ended up using the json2avro C tool, which is very fast and can handle nested arrays (but this bit us later because of impala). This tool generates avro files which we load in hdfs via a hdfs fuse mount point. This improved performance drastically. Since we are using this fuse mountpoint, we had no data loading issues or delay, whereas we had trouble every other week while using flume.

Default values

We started with writing a schema with default values. Sadly, we ended up noticing that JSON is only a convenient representation of data useful for debugging but is not the main purpose of avro.

This means that representing a missing source field in an avro schema can be done that way:

{"valid": {"boolean": true}, "source": null}

but a JSON document actually missing this field is not valid.

Build a rpm for pentaho pdi (kettle)

To nicely deploy pdi at $work, I wanted to have it in our yum repositories. For this I used the fantastic fpm, the Effing Package Manager which enables you to build rpm without having to deal with complex spec files. In short you tell it that you want to build a rpm from a directory, no other options are mandatory, and it just works (but a few options are nice to tweak).

If you use my startup script, you can even add it in the rpm. The final command is something like:

fpm -s dir -t rpm \
  --name pentaho-pdi \
  --version 4.3.0 \
  --depends jdk \
  --vendor 'me@thisdwhguy.com' \
  --url 'https://github.com/pentaho/pentaho-kettle' \
  --description 'Pentaho pdi kettle' \
  --maintainer 'Me &lt;me@thisdataguy.com&gt;' \
  --license 'Apache 2.0' \
  --epoch 1 \
  --directories /opt/pentaho_pdi \
  --rpm-user pentaho \
  --rpm-group pentaho \
  --architecture all \
  --after-install=after-install.sh \
  ./pentaho_pdi=/opt \
  ./carte=/etc/init.d/carte

It wil probably not be exactly what you want, regarding paths and user. Furthermore, the after-install script needs to be generated (it just sets up ownership and rights of /etc/init.d/carte)

To make it easier, I created a small bash script with a few configuration variables and a few extra checks (mysql and vertica jars) which makes building very easy. You can just get this script, remove the checks if they are irrelevant to you, and you should be good to go. The script will even install fpm for you if needed.

#!/bin/bash

if [ "x$1" == "x" ]; then
  echo "Need one parameter: the pentaho version string (eg 5.2.0.1)"
  exit 1;
else
  PDIVERSION=$1
fi

# name of the directory where pdi will be installed
PDIDIR=pentaho_pdi
# user to own the pdi files
PDIUSER=pentaho
# root of where pdi will be installed
PDIROOT=/opt


if ! which fpm 1&gt;/dev/null 2&gt;/dev/null; then
    echo "fpm is not installed. I will try to do it myself"
    echo "Installing relevant rpms..."
    sudo yum install -y ruby-devel gcc
    echo "Installing the fpm gem..."
    sudo gem install fpm
    if ! which fpm 1&gt;/dev/null 2&gt;/dev/null; then
        echo "failed installing fpm, please do it yourself: https://github.com/jordansissel/fpm"
        exit 1
    fi
else
    echo "fpm installed, good."
fi

if [ ! -d "$PDIDIR" ]; then
    echo "I expect a directory called $PDIDIR."
    echo "It is the 'dist' directory built from source renamed as $PDIDIR."
    echo "Look at https://github.com/pentaho/pentaho-kettle"
    exit 1
else
    echo "$PDI_DIR directory exists, good."
fi

ERRORS=0

find $PDIDIR -name \*mysql\*.jar | grep -qE '.*'
if [[ $? -ne 0 ]]; then
    echo  "Download the mysql jar from http://dev.mysql.com/downloads/connector/j/ and put it in the libext/JDBC (&lt;5.0) or lib (&gt;= 5.0) subdirectory of $PDIDIR."
    ERRORS=1
else
    echo "Mysql jar present in $PDIDIR, good."
fi

find $PDIDIR -name \*vertica\*.jar | grep -qE '.*'
if [[ $? -ne 0 ]]; then
    echo  "Get the vertica jar from /opt/vertica and put it in the libext/JDBC (&lt;5.0) or lib (&gt;= 5.0) subdirectory of $PDIDIR."
    ERRORS=1
else
    echo "Vertica jar present in $PDIDIR, good."
fi

if [[ $ERRORS -eq 1 ]]; then
    exit 1
fi

# the init.d script will be installed as $PDIUSER, whereas it should be root

# Check that carte init script exists, if yes add it to the options
if [ -f ./carte ]; then
(cat &lt;&lt; EOC
#!/usr/bin/env sh
chown root:root /etc/init.d/carte
chmod 744 /etc/init.d/carte
chkconfig --add carte
EOC
) &gt; ./after-install.sh
    echo "After install script for carte generated at after-install.sh"
    CARTEOPTIONS="--after-install=after-install.sh ./carte=/etc/init.d/carte"
else
    CARTEOPTIONS=""
    echo "No Carte init.d script present."
fi


# All good, let's build
echo "Build the effing rpm, removing existing rpms first..."
rm -f pentaho-pdi*rpm
fpm -s dir -t rpm \
  --name pentaho-pdi \
  --version $PDIVERSION \
  --depends jdk \
  --vendor 'me@thisdataguy.com' \
  --url 'https://github.com/pentaho/pentaho-kettle' \
  --description 'Pentaho pdi kettle' \
  --maintainer 'me@thisdataguy.com' \
  --license 'Apache 2.0' \
  --epoch 1 \
  --directories $PDIROOT/$PDIDIR \
  --rpm-user $PDIUSER \
  --rpm-group $PDIUSER \
  --architecture all $CARTEOPTIONS \
  ./pentaho_pdi=/$PDIROOT \

rm -f after-install.sh

This will create a pentaho-pdi-${PDIVERSION}.noarch.rpm which you can just yum install or put it in your yum repositories.

pentaho pdi (kettle) carte init.d script

The following script is an LSB compliant init.d script allowing the carte webserver of pdi (kettle) to start at boot time.

Once the script is copied at /etc/init.d/carte you can use it the usual way:

service carte start
service carte status
service carte stop
service carte restart

To actually have carte started at boot time, register the script with chkconfig for redhat flavours:

chkconfig --add carte

or use update-rc.d carte defaults on debian flavours:

update-rc.d carte defaults

Note that under Redhat it uses the ‘/etc/rc.d/init.d/functions’ helper script which gives out nice colored output for [OK] or [FAILED]. If this script does not exist, a fallback is present, a bit less nice but working just as well. This means that in theory this script should work under all flavors of Linux.

There are a few configuration variables near the top of the script (user to run carte under, path to carte and port to listen to), but that is about it. You can find the script on github as well.

#!/bin/bash

# Start the carte server as a daemon, and helps managing it in a normal
# (service carte start/stop/status) way.

# Licence: FreeBSD, do what you want with it but do not hold me responsible.

### BEGIN INIT INFO
# Provides: pdi
# Required-Start: $network $syslog $remote_fs
# Required-Stop: $network $syslog
# Default-Start: 3 5
# Default-Stop: 0 1 2 4 6
# Short-Description: Pentaho Carte Server
# Description: Pentaho Carte Server
#
### END INIT INFO

## configuration directives
# Which user does carte run under?
PDIUSER=pentaho
# On which port should it listen?
CARTEPORT=80
# Where is pdi installed?
PDIROOT=/opt/pentaho_pdi
# Normal output log
LOGOUT=/var/log/pentaho_pdi.out.log
# Error output log
LOGERR=/var/log/pentaho_pdi.err.log

## script start here

# Note: The functions script is RH only. It is only used here for sexy (colored)
# output of [OK] or [FAILED] via echo_failure and echo_success.
#
# To make this script work under other flavors of linux, the 2 echo_ functions
# are first defined in a portable (but unsexy) way. If the RH functions script
# exists, its definition will override the portable way.
function echo_failure() { echo -en "\n[FAILED]"; }
function echo_success() { echo -en "\n[OK]"; }
[ -f /etc/rc.d/init.d/functions ] && source /etc/rc.d/init.d/functions

# Very useful for debugging
#set -x

# Find PID of the newest (-n) process owned by $PDIUSER (-u) with carte.sh on
# the full (-f) command, arguments included.
# => this should yield the pid of 'sh ./carte.sh' on STDOUT, with a status of 0
# if there is such a process, 1 otherwise
FINDPID="pgrep -u $PDIUSER -n -f carte.sh";
function _is_running() {
    $FINDPID 1>/dev/null
    return $?
}

function stop_carte() {
    _is_running
    if [ $? -ne 0 ]; then
        echo -n "$0 is not running, cannot stop."
        echo_failure
        echo
        return 1
    else
        echo -n "Stopping $0..."
        # Finding the pid of carte.sh from $FINDPID. Killing it would leave its
        # child, the actual java process, running.
        # Find this java process via ps and kill it.
        $FINDPID | xargs ps h -o pid --ppid | xargs kill
        sleep 1
        _is_running
        if [ $? -eq 0 ]; then
            echo_failure
            echo
            return 1
        else
            echo_success
            echo
            return 0
        fi
    fi

}

function status() {
    _is_running
    if [ $? -eq 0 ]; then
        echo -n "$0 is running."
        echo_success
        echo
        return 0
    else
        echo -n "$0 does not run."
        echo_failure
        echo
        return 1
    fi
}

function start_carte() {
    _is_running
    if [ $? -eq 0 ]; then
        echo -n "$0 already running."
        echo_failure
        echo
        return 1
    else
        echo -n "Starting $0..."
        # Make sure log files exist and are writable by $PDIUSER first
        touch $LOGOUT $LOGERR
        chown $PDIUSER:$PDIUSER $LOGOUT $LOGERR
        su - $PDIUSER -c "cd $PDIROOT && (nohup sh ./carte.sh $(hostname -i) $CARTEPORT 0<&- 1>>$LOGOUT 2>>$LOGERR &)"
        sleep 1
        _is_running
        if [ $? -eq 0 ]; then
            echo_success
            echo
            return 0
        else
            echo_failure
            echo
            return 1
        fi
    fi
}

case "$1" in
    start)
        start_carte
        exit $?
        ;;
    stop)
        stop_carte
        exit $?
        ;;
    reload|force-reload|restart|force-restart)
        stop_carte
        if [ $? -eq 0 ]; then
            start_carte
            exit $?
        else
            exit 1
        fi
        ;;
    status)
       status
       exit $?
       ;;
    *)
        echo "Usage: $0 {start|stop|restart|status}"
        exit 2
esac
exit 0

Vertica: rename multiple tables in one go

I have the use case where I need to regularly fully drop and recreate a table in Vertica. To try to keep the period without data to a minimum, I want to load data in an intermediate table, then rename the old table out of the way, and rename the intermediate to its final name. It turns out that Vertica allows this in one command, hopefully thus avoiding race conditions.

After loading, before renaming:

vertica=> select * from important;
 v
-----
 old
(1 row)

vertica=> select * from important_intermediate ;
 v
-----
 new
(1 row)

Multiple renaming:

ALTER TABLE important,important_intermediate RENAME TO important_old, important;

after:

vertica=> select * from important;
 v
-----
 new
(1 row)
vertica=> select * from important_old;
 v
-----
 old
(1 row)

You will probably want to DROP import_old now.

Avro end to end in hdfs – part 2: Flume setup

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume (this post)
  3. How to use avro with hive
  4. Problems and solutions

Set up flume

Believe it or not, this is the easy part.

On the source, there is nothing specific to add, you can carry on as usual.

On the sink here is a sample with comments:

agent.sinks.hdfs.type=hdfs
# Very important, *DO NOT* use CompressedStream. Avro itself will do the compression
agent.sinks.hdfs.hdfs.fileType=DataStream
# *MUST* be set to .avro for Hive to work
agent.sinks.hdfs.hdfs.fileSuffix=.avro
# Of course choose your own path
agent.sinks.hdfs.hdfs.path=hdfs://namenode/datain/logs/key=%{some_partition}
agent.sinks.hdfs.hdfs.writeFormat=Text
# The magic happens here:
agent.sinks.hdfs.serializer=avro_event
agent.sinks.hdfs.serializer.compressionCodec=snappy

Note the hdfs.path. “some_key” might be timestamp, for instance, which could create a new directory every hour. This will be used later in Hive.

Using this configuration will use the default Avro schema, which you can find defined in the flume source:

{
 "type": "record",
 "name": "Event",
 "fields": [{
   "name": "headers",
   "type": {
     "type": "map",
     "values": "string"
   }
 }, {
   "name": "body",
   "type": "bytes"
 }]
}

If you want to use your own custom schema, you need to extend AbstractAvroEventSerializer. This is not very complex, and the default avro event serializer actually extends it already, hardcoding a schema. This is a good example to carry on. You would typically out the schema at an place reachable by the sink, being either hdfs itself or an url. The path could be hardcoded in your class if you have one schema only, or could be passed as a flume header.

If, as in the example, you are using snappy, first make sure that snappy is installed:

# RedHat world:
yum install snappy
# Debian world:
apt-get install libsnappy1

And that’s really it, there is nothing more to do to use the default schema.

Avro end to end in hdfs – part 1: why avro?

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro? (This post)
  2. How to set up avro in flume
  3. How to use avro with hive
  4. Problems and solutions

What is avro?

Avro, an apache project, is a data serialisation system. From the avro wiki, Avro provides:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages

Why is it better than raw data?

After all, a bit of json or xml would work just as well, right? You could indeed do whatever you do avro with json or xml, but it would be a lot more painful for many reasons. One of the main avro goal is to have self contained data structure. Storing the schema with the data file means that once you get a file, you have all information you need to process it. You can even automatically generate code from the data file itself to further process the data, in a very simple and quick way.

Furthermore, the schema being stored in the data file itself as a preamble, it means that it is not needed to duplicate it for each data line as json or xml would. This results in a file which is a lot smaller for the same data as the json or xml equivalent. For the same reason, data can be stored in an highly efficient binary format instead of plain text, which once more results in smaller size and less time spent in string parsing.

Another goal of avro is to support schema evolution. If your data structure changes over time, you do not want to have to update all your process flow in synch with the schema. With some restrictions, you can update an Avro schema on writers or readers without having to keep them aligned. Avro itself has a set of resolution rules which on well written schemas will provide good defaults or ignore unknown values.

Avro is splittable. This means that it is very easy for HDFS to cut an Avro files in pieces to match HDFS block size boundaries, and have a process running per block. This in turn improves disk usage and processing speed. In a non splittable format, Hadoop could only allocate one process to deal with the whole file, instead of one per block.

It is worth mentioning as well that Avro is well documented, is well integrated in the Hadoop ecosystem and has bindings in many languages.

Which types of compression are available?

Avro on its own will already result in smaller size files than raw data, as explained earlier. You can go further by using compressed avro. Avro supports 2 compression formats, deflate (typically an implementation of zlib) and snappy.

A short comparison would be:

Snappy Deflate
Compression speed faster slower
Compression ratio smaller higher
Splitable yes no
Licence New BSD Old BSD

Even is the compression ratio of snappy is smaller, its goal is to be ‘good enough’. Adding to this that snappy is splitable (you can make deflate splitable but not without extra post processing) and that is the codec which is usually used

What are the other options?

Keeping data in raw format has already been discussed earlier. The other logical options are sequence files, thrift and protocol buffer.

Sequence files do not really compare. It is not language independent as Avro is (it is all java), and schema versioning is not possible.

Thrift (Originated at Facebook, Apache 2.0) and Protocol buffer (originated at Google, BSD) are the main competitors. They all have schema evolution, are splittable, somehow compress data, make processing faster. Usually ProtocolBuffer and Avro are close in size and processing time, with thrift being somewhat bigger and slower.

The main pros of Avro is that the schemas do not need to be compiled (you can just use the json schema wherever you need it) and it is well integrated into Hadoop. Some benchmarks can be found around with better numbers.

Further readings and documentation

Schema evolution in Avro, ProtocolBuffer and thrift, with good technical explanation of how data is stored.

Slides and talk from Igor Anishchenko at Java Tech Talk #1: protocol buffer vs. avro vs. thrift.