Easy import from Hive to Vertica

Properly setup, Vertica can connect to Hcatalog, or read hdfs files. This does require some DBA work, though.

If you want to easily get data fro Hive to Vertica, you can use the COPY statement with the LOCAL STDIN modifier and pipe the output of Hive to the input of Vertica. Once you add a dd in the middle to prevent the stream to just stop after a while, this works perfectly. I am not so sure why dd is needed, but I suppose it buffers data and makes the magic happen.

hive -e "select whatever FROM wherever" | \
dd bs=1M | \
/opt/vertica/bin/vsql -U $V_USERNAME -w $V_PASSWORD -h $HOST $DB -c \
"COPY schema.table FROM LOCAL STDIN DELIMITER E'\t' NULL 'NULL' DIRECT"

Of course, the previous statement needs to be amended to use your own user, password and database.

The performance are quite good with this, although I cannot give a good benchmark as in our case the hive statement was not trivial.

One thing to really take care of is where you run this statement. You can run it from everywhere as long as hive and Vertica are accessible, but be aware that data will flow from hive to your server to Vertica. Running this statement on a Vertica node or your hive server will reduce the network traffic and might speed up things.

This post is based on my answer to a question on stackoverflow.

Advertisements

Avro end to end in hdfs – part 3: Hive

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume
  3. How to use avro with hive (this post)
  4. Problems and solutions

Use avro in Hive

Once your table is created, and data is loaded, there is nothing extra to do, you can just query it as you would any other table.

Create the table

Creating the table can be done as follow, with some comments:

-- table name
CREATE EXTERNAL TABLE IF NOT EXISTS table_name

-- Partition according to the end of the path you set in the flume sink (hdfs.path option).
-- Following the example form previous post, we would have
PARTITIONED BY (key STRING)

-- Avro!
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

-- Matches the first part of hdfs.path set up in the flume sink
-- Following the example of the previous post, we would have
LOCATION '/datain/logs'

-- Other options here are to hardcode the schema or use a file on your local filesystem instead.
TBLPROPERTIES ('avro.schema.url'='hdfs:///schemas/schema.avsc');

More information can be found on the cloudera documentation about hive and avro.

Load the snappy jar

To load data, you need to tell Hive that the data files will be compressed, and Hive needs to know how to decompress. For this, you need to add the snappy jar to the list of extra jars loaded by Hive. This is done by adding the path to the snappy jar to the value to the hive.aux.jars.path property of your hive-site.xml. For instance:

<property>
  <name>hive.aux.jars.path</name>
  <value>file:////usr/lib/hive/lib/hive-contrib.jar,...,file:////usr/lib/hive/lib/auxlib/snappy-java-1.0.4.1.jar</value>;
</property>

Actually load data

You need to tell hive to use snappy, which is done the following way:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Then loading data means creating a new partition when a new directory is created with another key. Run this after having told hive to use snappy:

ALTER TABLE table_name
ADD IF NOT EXISTS PARTITION /datain/logs/key=some_new_key
LOCATION '/datain/logs';

Using data with the default schema

If you use a custom schema, tailored to your data, you can then enjoy the full speed of Hive, as not much parsing will be needed by Hive to access your data.

If you use the default schema, then Hive does not know (yet) about the columns in your table. This can be fixed by the decode() function. For instance,

SELECT
hour, decode(body,'UTF-8') as body
FROM my_table

Accessing Hive via JDBC with DbVisualiser

Hive has a SQL interface via jdbc, you can thus connect to it with your usual tools (DBVis, Squirrel). I will explain here how to connect via DBVis, but the ideas are very similar when connecting with any other tool.

Create a new driver

From Tool > Driver manager, click the top button with a green ‘+’ to add a driver as shown in the following screenshot.

Fill up then the ‘Name’ (this is for you only, the value can be different than the one shown below) and ‘URL format’ fields as shown.

Note that in the picture the port is 9000, usually the default hive port is 10000.

New Driver

Add the relevant jars

You need to load a set of jars to be able to connect to Hive. Make sure that you use exactly the same version for hive-jdbc and hive-service as the one on your hive server. I usually take a copy of all the jars and put them in one directory, making my life easier. You can get the jars from your hive’s installation directory:

  • On a Mac with homebrew, after having installed hive you can find them under /usr/local/Cellar.
  • On a rpm-based linux, from the cloudera rpms (install hive) you can find them somewhere under somewhere under /usr/lib
  • Directly in your downloaded folders if you get the binary version of hive from cloudera.

Note that as version number might slightly differ I do not write them in the jar names below.

  • You will need:
    • commons-logging.jar
    • hive-jdbc.jar
    • hive-service.jar
    • libthrift.jar
    • slf4j-api.jar
    • slf4j-log4j12.jar

Click then the ‘open icon’ as shown as in the previous screenshot, and select all the jars you uncompressed in the previous step. This will fill the ‘Driver Class’ dropdown box, you need to choose ‘org.apache.hive.jdbc.HiveDriver’.

driverclass

Create a new database connection

Right click on Connection, select ‘New connection’ (see 1 in the picture below.)
Fill up then the details pointed by 2 in the picture. The name is for your reference only, select the Driver name you just set up in the previous phase, and fill up the database URL as shown.

Click then connect (or reconnect if you are trying again), pointed by 3 in the picture. You should see some output as displayed in the connection message.

Database Connection

You are now good to go!