Hive self merge

I had a (ORC) table with duplicated rows, which I wanted to remove. The query is quite simple:

merge into click as dst using (
    select
      -- For all unique clicks...
        client_name
      , contact_id
      , ts_utc
      -- ... find the duplicates (cnt>1 in having) ...
      , count(*) cnt
      -- ... remember the first one loaded ...
      , min(load_ts) as first_load
      from
        click
      group by
        1, 2, 3
      having cnt > 1
)
as src
-- ... once the first occurrence of the duplicates
-- is found find all the duplicates ...
on
        dst.client_name=src.client_name
    and dst.contact_id=src.contact_id
    and dst.ts_utc=src.ts_utc
-- ... and if it is not the first one loaded ...
when matched and src.first_load != dst.load_ts
-- .. delete it.
then delete
;

Trivial, right? Well it looks like you cannot do such a ‘self merge’ in hive. I ended up with this error:

java.lang.InterruptedException
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
 at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
 at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90)
 at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116)
 [...]

The solution, once understood that a self merge is not allowed, is of course obvious: use a temporary table. Splitting my merge statement in 2 did the trick.

create temporary table clickdups stored as orc as select
        client_name
      , contact_id
      , ts_utc
      , count(*) cnt
      , min(load_ts) as first_load
      from
        click
      group by
        1, 2, 3
      having cnt > 1
;

merge into click as dst using clickdups
as src
on
        dst.client_name=src.client_name
    and dst.contact_id=src.contact_id
    and dst.ts_utc=src.ts_utc
when matched and src.first_load != dst.load_ts
then delete
;

On a side note I needed to tweak a lot the self-merge to prevent out of memory error. Those did not happen at all using the 2 steps solution.

Advertisements

Create a time dimension table in pure hive SQL

Without further ado, here is the full SQL to create a table giving you a table with one row per day, with date, year, mont, day, day and name of the week, day of the year. If you want the hours as well, look at the bottom of this post.

set hivevar:start_day=2010-01-01;
set hivevar:end_day=2050-12-31;
set hivevar:timeDimTable=default.timeDim;

create table if not exists ${timeDimTable} as
with dates as (
select date_add("${start_day}", a.pos) as d
from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a
)
select
    d as d
  , year(d) as year
  , month(d) as month
  , day(d) as day
  , date_format(d, 'u') as daynumber_of_week
  , date_format(d, 'EEEE') as dayname_of_week
  , date_format(d, 'D') as daynumber_of_year

from dates
sort by d
;

Note that I use d as date column because date is a reserved keyword.

The biggest issue is to generate one row per day. The trick here is to use a clever combination of posexplode, split and reapeat. This is what the first CTE does:

-- just 10 days for the example
set hivevar:start_day=2010-01-01;
set hivevar:end_day=2010-01-10;
select date_add("${start_day}", a.pos) as d
from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a

We can break it down in a few parts:

select datediff("${end_day}", "${start_day}");
-- output: 9

Just computes the difference between start and end day in days.

select repeat("o", 9);
-- output: ooooooooo

Will output a string with 9 ‘o’. The actual character does not matter at all.

select split("ooooooooo", "o");
-- output:  ["","","","","","","","","",""]

Creates a hive array of 9 (empty) strings.

select posexplode(split("ooooooooo", "o"));
-- output:
-- +------+------+--+
-- | pos | val |
-- +------+------+--+
-- | 0 | |
-- | 1 | |
-- | 2 | |
-- | 3 | |
-- | 4 | |
-- | 5 | |
-- | 6 | |
-- | 7 | |
-- | 8 | |
-- | 9 | |
-- +------+------+--+

Actually create a row per array element, with the index (0 to 9) and the value (nothing) of each element.

That was the tricky part, the rest is easy. The first CTE creates a row with each date, adding the array index (in day) to the start_day:

with dates as (
select date_add("${start_day}", a.pos) as d
from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a)
select * from dates;
-- +-------------+--+
-- | dates.d |
-- +-------------+--+
-- | 2010-01-01 |
-- | 2010-01-02 |
-- | 2010-01-03 |
-- | 2010-01-04 |
-- | 2010-01-05 |
-- | 2010-01-06 |
-- | 2010-01-07 |
-- | 2010-01-08 |
-- | 2010-01-09 |
-- | 2010-01-10 |
-- +-------------+--+

From there on, you can just create whatever column you feel like. Quarter column? floor(1+ month(d)/4) as quarter. Long name of the week? date_format(d, 'EEEE') as dayname_of_week_long.

As a bonus, I give you the same table but with hours added. The principles are exactly the same, with a cartesian join beween dates and hour:

set hivevar:start_day=2010-01-01;
set hivevar:end_day=2010-01-02;
set hivevar:timeDimTable=default.timeDim;

create table if not exists ${timeDimTable} as<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;">&#65279;</span>
with dates as (
  select date_add("${start_day}", a.pos) as d
  from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a
),
hours as (
  select a.pos as h
  from (select posexplode(split(repeat("o", 23), "o"))) a
)
select
    from_unixtime(unix_timestamp(cast(d as timestamp)) + (h * 3600)) as dt
  , d as d
  , year(d) as year
  , month(d) as month
  , day(d) as day
  , h as hour
  , date_format(d, 'u') as daynumber_of_week
  , date_format(d, 'EEEE') as dayname_of_week
  , date_format(d, 'D') as daynumber_of_year

from dates
join hours
sort by dt
;

Alter location of a Hive table

Long story short: the location of a hive managed table is just metadata, if you update it hive will not find its data anymore. You do need to physically move the data on hdfs yourself.

Short story long:

You can decide where on hdfs you put the data of a table, for a managed table:

create table if not exists tstloc (id bigint)
clustered by (id) into 4 buckets
stored as orc
location 'hdfs:///tmp/ttslocorig'
tblproperties ("transactional"="true");
insert into tstloc values(1);
select * from tstloc;

Now if you want to move this table to another location for any reason, you might run the following statement:

alter table tstloc set location 'hdfs:///tmp/ttslocnew';

But then the table is empty!

select * from tstloc;

will return an empty set. The reason is that the location property is only metadata, telling hive where to look without any effect on said location (except at creation time, where the location will be created if it does not exist for managed tables). If nothing happens to be there, hive will not return anything. Conversely, if it happens to be something, hive will return this something.

To get your data back, you just need to physically move the data on hdfs at the expected location:

hdfs dfs -mv /tmp/ttslocorig /tmp/ttslocnew

 

 

Compression of ORC tables in Hive

I only use ORC tables in Hive, and while trying to understand some performance issues I wanted to make sure my tables where properly compressed. This is easy, just run

desc extended table;

and search the output for the string

compressed:true

Well, it turned out that it was false for all my tables although I was pretty sure I set up everything correctly, so I dug and experimented a bit. I generated an easy to compress data set, and load it in a few different tables with different options.

# create 1 csv, 500MB of easy to compress data
yes '1,longish string which will compress really well' | head -n 10000000 > /tmp/source.csv

# Copy this file in hdfs
hdfs dfs -mkdir /tmp/compressiontest
hdfs dfs -copyFromLocal /tmp/source.csv /tmp/compressiontest/source.csv

Then I loaded this data in 2 tables, compressed and uncompressed, directed with the setting hive.exec.compress.output.


CREATE EXTERNAL TABLE sourcedata (id INT, s STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/tmp/compressiontest'
;
MSCK REPAIR TABLE sourcedata;

CREATE TABLE shouldbecompressed ( id INT, s STRING)
STORED AS ORC
LOCATION '/tmp/shouldbecompressed';

CREATE TABLE shouldbeuncompressed (id INT, s STRING)
STORED AS ORC
LOCATION '/tmp/shouldbeuncompressed';

set hive.exec.compress.output=true;
INSERT INTO shouldbecompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldbecompressed;

set hive.exec.compress.output=false;
INSERT INTO shouldbeuncompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldbeuncompressed;

I still have compressed:false, but what happens on disk?

hdfs dfs -du -s -h /tmp/should\*

42.5 K /tmp/shouldbecompressed
39.8 K /tmp/shouldbeuncompressed

Hum, apparently both tables are compressed? It turned out that I forgot about an orc parameter (orc.compress), set by default to ZLIB for me. The other valid values are SNAPPY or NONE. So let’s try again:

CREATE TABLE shouldreallybecompressed ( id INT, s STRING)
STORED AS ORC
LOCATION '/tmp/shouldreallybecompressed'
TBLPROPERTIES ("orc.compress"="ZLIB")
;

CREATE TABLE shouldreallybeuncompressed ( id INT, s STRING)
STORED AS ORC
LOCATION '/tmp/shouldreallybeuncompressed'
TBLPROPERTIES ("orc.compress"="NONE")
;

set hive.exec.compress.output=true;
INSERT INTO shouldreallybecompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldreallybecompressed;

set hive.exec.compress.output=false;
INSERT INTO shouldreallybeuncompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldreallybeuncompressed;
hdfs dfs -du -s -h /tmp/should\*

42.5 K /tmp/shouldbecompressed
39.8 K /tmp/shouldbeuncompressed
38.8 K /tmp/shouldreallybecompressed
3.8 M /tmp/shouldreallybeuncompressed

So indeed, the uncompressed table is less compressed, but is still a far cry from the 500MB I expected.

Long story short, ORC does some compression on its own, and the parameter orc.compress is just a cherry on top. on a side note, using SNAPPY instead of ZLIB the data size was 197k instead of 44k.

To look even deeper, hive on the command line has an option –orcfiledump, which will give some metadata about an orc file. So looking at a compressed file:

hive --orcfiledump /tmp/shouldbecompressed/000007_0

We can see, among other lines:

# yes, compressed!
Compression: ZLIB

# This is the buffer size, nothing to do with actual data size
Compression size: 262144

File length: 5459 bytes

For an uncompressed file:

hive --orcfiledump /tmp/shouldreallybeuncompressed/000000_0

Compression: NONE
File length: 136741 bytes

Long story short, the output of desc extended regarding compression is useless. And all my tables are indeed compressed.

This example was a bit artificial as the source file was very compressible. With another source file more random, generated as follow:


cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold | head -c 500000k | awk '{print "1," $0}'> source.csv

Then the size on disk becomes:

370.4 M /tmp/shouldbecompressed
370.4 M /tmp/shouldbeuncompressed
370.4 M /tmp/shouldreallybecompressed
490.0 M /tmp/shouldreallybeuncompressed

And just because I am nice, here are the lines to clean up your droppings:

drop table shouldbecompressed;
drop table shouldbeuncompressed;
drop table shouldreallybeuncompressed;
drop table shouldreallybecompressed;
drop table sourcedata;

Environment variables in Hive

I will here explain how to set and use variables in hive.

How to set a variable

Just use the keyword set

set foo=bar;
set system:foo=bar

Alternatively, for the hiveconf namespace you can set the variable on the command line:

beeline ----hiveconf foo=bar

How to use a variable

Wherever you want to use a value, use this syntax instead: ${namespace:variable_name}. For instance:

select '${hiveconf:foo}', '${system:foo}', '${env:CLASSPATH}';

Note that variables will be replaced before anything else happens. This means that this is perfectly valid:

set t=employees;
set verb=desc;
${hiveconf:verb} ${hiveconf:t};

But this will not do what you expect (hint: you will end up with 4 quotes in your select statement):

set s='Hello world';
select '${hiveconf:s}';

 

Furthermore, it means that you need to take care of your data type. As selecting a bare string is not valid, so is the following code invalid as well:

set v=astring;
select ${hiveconf:v};

You will get:

Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference ‘astring’: (possible column names are: ) (state=42000,code=10004)

In our case, you just need to quote the variable.

Note that this would work as-is with an int as a bare int is valid in the select statement.

Another caveat is to make sure the variable exists, otherwise you will either get the variable literal for quoted variables:

select ‘${hiveconf:donotexists}’;
+————————–+–+
| _c0 |
+————————–+–+
| ${hiveconf:donotexists} |
+————————–+–+

Either an unhelpful message for unquoted variables:

> select ${hiveconf:doesnotexist};
Error: Error while compiling statement: FAILED: ParseException line 1:7 cannot recognize input near ‘$’ ‘{‘ ‘hiveconf’ in select clause (state=42000,code=40000)

If you only want to see the value of a variable, you can just use set as well:

set hiveconf:foo;

How to List variables

Just use SET;, but this will output a massive unreadable list. You are better off redirecting this output to a file, e.g.

beeline -e 'SET;' | sed 's/\s\+/ /g'> set.out

Note that I squash the spaces here. As the columns are aligned and some values are very long strings, squashing makes reading much easier.

Then if you want to see a specific set of variables, you can just run:

# system variables
grep '| system:' set.out

# Env variables
grep '| env:' set.out

# other variables
cat set.out | grep -v '| env:' | grep -v '| system:'

Namespaces

Hive has 3 namespaces for variables: hiveconf, system and env.

Hiveconf

Hiveconf is the namespace used when you use set or when you give a variable on the command line with –hiveconf foo=bar. Note that you can set those without specifying the namespace, but you always need to specify the namespace when using them.

set foo=bar;
select "${hiveconf:foo}";

env

This is the namespace of the shell environment variables. You can easily get them with the ${env} prefix:

SELECT "${env:hostname}";

I specifically chose this variable. If you run this query yourself, you will see that it is the environment of the hive server which is used, not the environemnt of your client. This limits a lot the use of environment variables.

Note that environment variables cannot be set.

system

Those will contain for instance jvm settings, logfile destinations and more.

 

Extracting queries from Hive logs

Hive logs are very verbose, and I personally find it a pain to wade through them when I try to understand which queries my ETL tool decided to generate.

To help with this, I created this small python script which looks at hive logs files and output the SQL queries and only the queries, with some information about them if known: time started, duration, success.

Usage:

./hqe.py --help 
usage: hqe.py [-h] [--since SINCE] [--to TO] [--logdir LOGDIR]
              [--glob LOGFILE_GLOB]
              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Displays queries ran on Hive.

optional arguments:
 -h, --help show this help message and exit
 --since SINCE how far to look back. (default: 15m)
 --to TO How far to look forward. (default: now)
 --logdir LOGDIR Directory of hive log files. (default: /var/log/hive)
 --glob LOGFILE_GLOB Shell pattern of hive logfiles inside their logdir.
 (default: hiveserver2.log*)
 --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}
 Log level. (default: warn)

Sample output:

Started at 2017-06-22 05:30:58 for 12.788000s by hive on ip-10-0-0-10.eu-west-2.compute.internal (Probably success). (Thread id: 79733, query id: hive_20170622053058_676612af-7bb8-4c4b-8fce-51bd1ae7be71, txn id: 0):
SELECT  
 id,
 count(*)
FROM 
 raw.event
GROUP BY
 1 
ORDER BY -- required for next step
 sys_partition

Started at 2017-06-22 05:31:25 for 0.018000s by Unknown on Unknown (Probably success). (Thread id: 79770, query id: hive_20170622053125_7d8e644a-5c23-4ca8-ab0f-20becdd65c3b, txn id: Unknown):
use events

Started at 2017-06-22 05:31:25 for Unknowns by Unknown on Unknown (FAILED). (Thread id: handler-46, query id: Unknown, txn id: Unknown):
MERGE INTO mart.click dst
USING (
 SELECT
 [big sql...]
 ) as r

FROM
 raw.click
 WHERE
 ${SEQ_CHECKER_SQL}
) src
ON

 [big sql...]

WHEN NOT MATCHED THEN INSERT VALUES (
  [more sql]
)
Error: ParseException line 36:4 cannot recognize input near '$' '{' 'SEQ_CHECKER_SQL' in expression specification

As you can see:

  • If user, hostname and duration are know they are displayed,
  • query is displayed with the same formatting as it was sent, inclusive comments,
  • error (if any) is showed. In my case, a variable is not expanded by the ETL tool.

You can find the source on github.

 

Avro end to end in hdfs – part 3: Hive

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume
  3. How to use avro with hive (this post)
  4. Problems and solutions

Use avro in Hive

Once your table is created, and data is loaded, there is nothing extra to do, you can just query it as you would any other table.

Create the table

Creating the table can be done as follow, with some comments:

-- table name
CREATE EXTERNAL TABLE IF NOT EXISTS table_name

-- Partition according to the end of the path you set in the flume sink (hdfs.path option).
-- Following the example form previous post, we would have
PARTITIONED BY (key STRING)

-- Avro!
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

-- Matches the first part of hdfs.path set up in the flume sink
-- Following the example of the previous post, we would have
LOCATION '/datain/logs'

-- Other options here are to hardcode the schema or use a file on your local filesystem instead.
TBLPROPERTIES ('avro.schema.url'='hdfs:///schemas/schema.avsc');

More information can be found on the cloudera documentation about hive and avro.

Load the snappy jar

To load data, you need to tell Hive that the data files will be compressed, and Hive needs to know how to decompress. For this, you need to add the snappy jar to the list of extra jars loaded by Hive. This is done by adding the path to the snappy jar to the value to the hive.aux.jars.path property of your hive-site.xml. For instance:

<property>
  <name>hive.aux.jars.path</name>
  <value>file:////usr/lib/hive/lib/hive-contrib.jar,...,file:////usr/lib/hive/lib/auxlib/snappy-java-1.0.4.1.jar</value>;
</property>

Actually load data

You need to tell hive to use snappy, which is done the following way:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Then loading data means creating a new partition when a new directory is created with another key. Run this after having told hive to use snappy:

ALTER TABLE table_name
ADD IF NOT EXISTS PARTITION /datain/logs/key=some_new_key
LOCATION '/datain/logs';

Using data with the default schema

If you use a custom schema, tailored to your data, you can then enjoy the full speed of Hive, as not much parsing will be needed by Hive to access your data.

If you use the default schema, then Hive does not know (yet) about the columns in your table. This can be fixed by the decode() function. For instance,

SELECT
hour, decode(body,'UTF-8') as body
FROM my_table