This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.
- Why avro?
- How to set up avro in flume
- How to use avro with hive (this post)
- Problems and solutions
Use avro in Hive
Once your table is created, and data is loaded, there is nothing extra to do, you can just query it as you would any other table.
Create the table
Creating the table can be done as follow, with some comments:
-- table name
CREATE EXTERNAL TABLE IF NOT EXISTS table_name
-- Partition according to the end of the path you set in the flume sink (hdfs.path option).
-- Following the example form previous post, we would have
PARTITIONED BY (key STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
-- Matches the first part of hdfs.path set up in the flume sink
-- Following the example of the previous post, we would have
-- Other options here are to hardcode the schema or use a file on your local filesystem instead.
More information can be found on the cloudera documentation about hive and avro.
Load the snappy jar
To load data, you need to tell Hive that the data files will be compressed, and Hive needs to know how to decompress. For this, you need to add the snappy jar to the list of extra jars loaded by Hive. This is done by adding the path to the snappy jar to the value to the hive.aux.jars.path property of your hive-site.xml. For instance:
Actually load data
You need to tell hive to use snappy, which is done the following way:
Then loading data means creating a new partition when a new directory is created with another key. Run this after having told hive to use snappy:
ALTER TABLE table_name
ADD IF NOT EXISTS PARTITION /datain/logs/key=some_new_key
Using data with the default schema
If you use a custom schema, tailored to your data, you can then enjoy the full speed of Hive, as not much parsing will be needed by Hive to access your data.
If you use the default schema, then Hive does not know (yet) about the columns in your table. This can be fixed by the decode() function. For instance,
hour, decode(body,'UTF-8') as body