This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

What is avro?

Avro, an apache project, is a data serialisation system. From the avro wiki, Avro provides:

Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages

Why is it better than raw data?

After all, a bit of json or xml would work just as well, right? You could indeed do whatever you do avro with json or xml, but it would be a lot more painful for many reasons. One of the main avro goal is to have self contained data structure. Storing the schema with the data file means that once you get a file, you have all information you need to process it. You can even automatically generate code from the data file itself to further process the data, in a very simple and quick way.

Furthermore, the schema being stored in the data file itself as a preamble, it means that it is not needed to duplicate it for each data line as json or xml would. This results in a file which is a lot smaller for the same data as the json or xml equivalent. For the same reason, data can be stored in an highly efficient binary format instead of plain text, which once more results in smaller size and less time spent in string parsing.

Another goal of avro is to support schema evolution. If your data structure changes over time, you do not want to have to update all your process flow in synch with the schema. With some restrictions, you can update an Avro schema on writers or readers without having to keep them aligned. Avro itself has a set of resolution rules which on well written schemas will provide good defaults or ignore unknown values.

Avro is splittable. This means that it is very easy for HDFS to cut an Avro files in pieces to match HDFS block size boundaries, and have a process running per block. This in turn improves disk usage and processing speed. In a non splittable format, Hadoop could only allocate one process to deal with the whole file, instead of one per block.

It is worth mentioning as well that Avro is well documented, is well integrated in the Hadoop ecosystem and has bindings in many languages.

Which types of compression are available?

Avro on its own will already result in smaller size files than raw data, as explained earlier. You can go further by using compressed avro. Avro supports 2 compression formats, deflate (typically an implementation of zlib) and snappy.

A short comparison would be:

	Snappy	Deflate
Compression speed	faster	slower
Compression ratio	smaller	higher
Splitable	yes	no
Licence	New BSD	Old BSD

Even is the compression ratio of snappy is smaller, its goal is to be ‘good enough’. Adding to this that snappy is splitable (you can make deflate splitable but not without extra post processing) and that is the codec which is usually used

What are the other options?

Keeping data in raw format has already been discussed earlier. The other logical options are sequence files, thrift and protocol buffer.

Sequence files do not really compare. It is not language independent as Avro is (it is all java), and schema versioning is not possible.

Thrift (Originated at Facebook, Apache 2.0) and Protocol buffer (originated at Google, BSD) are the main competitors. They all have schema evolution, are splittable, somehow compress data, make processing faster. Usually ProtocolBuffer and Avro are close in size and processing time, with thrift being somewhat bigger and slower.

The main pros of Avro is that the schemas do not need to be compiled (you can just use the json schema wherever you need it) and it is well integrated into Hadoop. Some benchmarks can be found around with better numbers.

5 thoughts on “Avro end to end in hdfs – part 1: why avro?”

Pingback: Avro end to end in hdfs – part 2: Flume setup | This DWH guy
Pingback: Avro end to end in hdfs – part 3: Hive | This DWH guy
Pingback: Avro end to end in hdfs – part 4: problems and solutions | This DWH guy
Mounika Chirukuri on January 25, 2017 at 13:28 said:

Very well explained

Reply ↓
Gerard on June 11, 2017 at 11:19 said:

Just to be clear. For a non splittable file situation, I am pretty sure that there will be multiple processes possible if there are multiple such files.

Reply ↓

This Data Guy

Journey in a world of big(ger) data

Avro end to end in hdfs – part 1: why avro?

What is avro?

Why is it better than raw data?

Which types of compression are available?

What are the other options?

Further readings and documentation

5 thoughts on “Avro end to end in hdfs – part 1: why avro?”

Leave a comment Cancel reply

What is avro?

Why is it better than raw data?

Which types of compression are available?

What are the other options?

Further readings and documentation

Share this:

Related

5 thoughts on “Avro end to end in hdfs – part 1: why avro?”

Leave a comment Cancel reply