I only use ORC tables in Hive, and while trying to understand some performance issues I wanted to make sure my tables where properly compressed. This is easy, just run
desc extended table;
and search the output for the string
compressed:true
Well, it turned out that it was false for all my tables although I was pretty sure I set up everything correctly, so I dug and experimented a bit. I generated an easy to compress data set, and load it in a few different tables with different options.
# create 1 csv, 500MB of easy to compress data yes '1,longish string which will compress really well' | head -n 10000000 > /tmp/source.csv # Copy this file in hdfs hdfs dfs -mkdir /tmp/compressiontest hdfs dfs -copyFromLocal /tmp/source.csv /tmp/compressiontest/source.csv
Then I loaded this data in 2 tables, compressed and uncompressed, directed with the setting hive.exec.compress.output.
CREATE EXTERNAL TABLE sourcedata (id INT, s STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/tmp/compressiontest' ; MSCK REPAIR TABLE sourcedata; CREATE TABLE shouldbecompressed ( id INT, s STRING) STORED AS ORC LOCATION '/tmp/shouldbecompressed'; CREATE TABLE shouldbeuncompressed (id INT, s STRING) STORED AS ORC LOCATION '/tmp/shouldbeuncompressed'; set hive.exec.compress.output=true; INSERT INTO shouldbecompressed SELECT * FROM sourcedata; SELECT COUNT(*) FROM shouldbecompressed; set hive.exec.compress.output=false; INSERT INTO shouldbeuncompressed SELECT * FROM sourcedata; SELECT COUNT(*) FROM shouldbeuncompressed;
I still have compressed:false, but what happens on disk?
hdfs dfs -du -s -h /tmp/should\*
42.5 K /tmp/shouldbecompressed
39.8 K /tmp/shouldbeuncompressed
Hum, apparently both tables are compressed? It turned out that I forgot about an orc parameter (orc.compress), set by default to ZLIB for me. The other valid values are SNAPPY or NONE. So let’s try again:
CREATE TABLE shouldreallybecompressed ( id INT, s STRING) STORED AS ORC LOCATION '/tmp/shouldreallybecompressed' TBLPROPERTIES ("orc.compress"="ZLIB") ; CREATE TABLE shouldreallybeuncompressed ( id INT, s STRING) STORED AS ORC LOCATION '/tmp/shouldreallybeuncompressed' TBLPROPERTIES ("orc.compress"="NONE") ; set hive.exec.compress.output=true; INSERT INTO shouldreallybecompressed SELECT * FROM sourcedata; SELECT COUNT(*) FROM shouldreallybecompressed; set hive.exec.compress.output=false; INSERT INTO shouldreallybeuncompressed SELECT * FROM sourcedata; SELECT COUNT(*) FROM shouldreallybeuncompressed;
hdfs dfs -du -s -h /tmp/should\*
42.5 K /tmp/shouldbecompressed
39.8 K /tmp/shouldbeuncompressed
38.8 K /tmp/shouldreallybecompressed
3.8 M /tmp/shouldreallybeuncompressed
So indeed, the uncompressed table is less compressed, but is still a far cry from the 500MB I expected.
Long story short, ORC does some compression on its own, and the parameter orc.compress is just a cherry on top. on a side note, using SNAPPY instead of ZLIB the data size was 197k instead of 44k.
To look even deeper, hive on the command line has an option –orcfiledump, which will give some metadata about an orc file. So looking at a compressed file:
hive --orcfiledump /tmp/shouldbecompressed/000007_0
We can see, among other lines:
# yes, compressed!
Compression: ZLIB# This is the buffer size, nothing to do with actual data size
Compression size: 262144File length: 5459 bytes
For an uncompressed file:
hive --orcfiledump /tmp/shouldreallybeuncompressed/000000_0
Compression: NONE
File length: 136741 bytes
Long story short, the output of desc extended regarding compression is useless. And all my tables are indeed compressed.
This example was a bit artificial as the source file was very compressible. With another source file more random, generated as follow:
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold | head -c 500000k | awk '{print "1," $0}'> source.csv
Then the size on disk becomes:
370.4 M /tmp/shouldbecompressed 370.4 M /tmp/shouldbeuncompressed 370.4 M /tmp/shouldreallybecompressed 490.0 M /tmp/shouldreallybeuncompressed
And just because I am nice, here are the lines to clean up your droppings:
drop table shouldbecompressed; drop table shouldbeuncompressed; drop table shouldreallybeuncompressed; drop table shouldreallybecompressed; drop table sourcedata;