This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.
- Why avro?
- How to set up avro in flume
- How to use avro with hive
- Problems and solutions (This post)
Invalid/non standard schemas
The avro tools available for different languages are not all exactly equivalent. The default one for java used in Hadoop, for instance, has issues when some fields can be set to null. Nested array are another issues in a lot of cases. The default avro parser from java cannot handle them properly. Furthermore, if you end up finding a way to generate avro files with nested arrays, some tools will not be able to read them. Hive will be fine, but Impala (as of version 1.2) is not able to read them.
I can only urge you to use simple schemas, this will make your life a lot easier.
Hive partitions and schema changes
If you use Hive partitions (and you should), all data in one specific partition must have the same schema. We used to have partitions per hour when loading some logs, but now we are actually adding the avro schema version in the partition path. That way, data encoded in a new schema will end up in a different partition even if data is related to the same hour.
Faster encoding and flexibility
We started loading data the standard way, via flume. This created a lot of issues as explained earlier (nested arrays mostly), and flume was actually using a lot of resources. We ended up using the json2avro C tool, which is very fast and can handle nested arrays (but this bit us later because of impala). This tool generates avro files which we load in hdfs via a hdfs fuse mount point. This improved performance drastically. Since we are using this fuse mountpoint, we had no data loading issues or delay, whereas we had trouble every other week while using flume.
Default values
We started with writing a schema with default values. Sadly, we ended up noticing that JSON is only a convenient representation of data useful for debugging but is not the main purpose of avro.
This means that representing a missing source field in an avro schema can be done that way:
{"valid": {"boolean": true}, "source": null}
but a JSON document actually missing this field is not valid.
Pingback: Avro end to end in hdfs – part 3: Hive | This DWH guy
Pingback: Avro end to end in hdfs – part 2: Flume setup | This DWH guy
Pingback: Avro end to end in hdfs – part 1: why avro? | This DWH guy