This Data Guy

Avro end to end in hdfs – part 2: Flume setup

July 28, 2014 08:33

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

Why avro?
How to set up avro in flume (this post)
How to use avro with hive
Problems and solutions

Set up flume

Believe it or not, this is the easy part.

On the source, there is nothing specific to add, you can carry on as usual.

On the sink here is a sample with comments:

agent.sinks.hdfs.type=hdfs
# Very important, *DO NOT* use CompressedStream. Avro itself will do the compression
agent.sinks.hdfs.hdfs.fileType=DataStream
# *MUST* be set to .avro for Hive to work
agent.sinks.hdfs.hdfs.fileSuffix=.avro
# Of course choose your own path
agent.sinks.hdfs.hdfs.path=hdfs://namenode/datain/logs/key=%{some_partition}
agent.sinks.hdfs.hdfs.writeFormat=Text
# The magic happens here:
agent.sinks.hdfs.serializer=avro_event
agent.sinks.hdfs.serializer.compressionCodec=snappy

Note the hdfs.path. “some_key” might be timestamp, for instance, which could create a new directory every hour. This will be used later in Hive.

Using this configuration will use the default Avro schema, which you can find defined in the flume source:

{
 "type": "record",
 "name": "Event",
 "fields": [{
   "name": "headers",
   "type": {
     "type": "map",
     "values": "string"
   }
 }, {
   "name": "body",
   "type": "bytes"
 }]
}

If you want to use your own custom schema, you need to extend AbstractAvroEventSerializer. This is not very complex, and the default avro event serializer actually extends it already, hardcoding a schema. This is a good example to carry on. You would typically out the schema at an place reachable by the sink, being either hdfs itself or an url. The path could be hardcoded in your class if you have one schema only, or could be passed as a flume header.

If, as in the example, you are using snappy, first make sure that snappy is installed:

# RedHat world:
yum install snappy
# Debian world:
apt-get install libsnappy1

And that’s really it, there is nothing more to do to use the default schema.

Posted by This data guy

Categories: avro, flume, hadoop, Tech

Tags: avro, flume, hadoop

« Older Newer »

3 Responses to “Avro end to end in hdfs – part 2: Flume setup”

[…] ← Previous Next → […]

By Avro end to end in hdfs – part 1: why avro? | This DWH guy on July 28, 2014 at 08:33
[…] ← Previous […]

By Avro end to end in hdfs – part 3: Hive | This DWH guy on August 4, 2014 at 08:07
[…] How to set up avro in flume […]

By Avro end to end in hdfs – part 4: problems and solutions | This DWH guy on October 27, 2014 at 07:11

Mobile Site | Full Site

Get a free blog at WordPress.com Theme: WordPress Mobile Edition by Alex King.

This Data Guy

Avro end to end in hdfs – part 2: Flume setup

Set up flume

Share this:

Related

3 Responses to “Avro end to end in hdfs – part 2: Flume setup”

Leave a Reply