Avro end to end in hdfs – part 4: problems and solutions

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume
  3. How to use avro with hive
  4. Problems and solutions (This post)

Invalid/non standard schemas

The avro tools available for different languages are not all exactly equivalent. The default one for java used in Hadoop, for instance, has issues when some fields can be set to null. Nested array are another issues in a lot of cases. The default avro parser from java cannot handle them properly. Furthermore, if you end up finding a way to generate avro files with nested arrays, some tools will not be able to read them. Hive will be fine, but Impala (as of version 1.2) is not able to read them.

I can only urge you to use simple schemas, this will make your life a lot easier.

Hive partitions and schema changes

If you use Hive partitions (and you should), all data in one specific partition must have the same schema. We used to have partitions per hour when loading some logs, but now we are actually adding the avro schema version in the partition path. That way, data encoded in a new schema will end up in a different partition even if data is related to the same hour.

Faster encoding and flexibility

We started loading data the standard way, via flume. This created a lot of issues as explained earlier (nested arrays mostly), and flume was actually using a lot of resources. We ended up using the json2avro C tool, which is very fast and can handle nested arrays (but this bit us later because of impala). This tool generates avro files which we load in hdfs via a hdfs fuse mount point. This improved performance drastically. Since we are using this fuse mountpoint, we had no data loading issues or delay, whereas we had trouble every other week while using flume.

Default values

We started with writing a schema with default values. Sadly, we ended up noticing that JSON is only a convenient representation of data useful for debugging but is not the main purpose of avro.

This means that representing a missing source field in an avro schema can be done that way:

{"valid": {"boolean": true}, "source": null}

but a JSON document actually missing this field is not valid.


3 thoughts on “Avro end to end in hdfs – part 4: problems and solutions

  1. Pingback: Avro end to end in hdfs – part 3: Hive | This DWH guy

  2. Pingback: Avro end to end in hdfs – part 2: Flume setup | This DWH guy

  3. Pingback: Avro end to end in hdfs – part 1: why avro? | This DWH guy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s