Compression of ORC tables in Hive

I only use ORC tables in Hive, and while trying to understand some performance issues I wanted to make sure my tables where properly compressed. This is easy, just run

desc extended table;

and search the output for the string


Well, it turned out that it was false for all my tables although I was pretty sure I set up everything correctly, so I dug and experimented a bit. I generated an easy to compress data set, and load it in a few different tables with different options.

# create 1 csv, 500MB of easy to compress data
yes '1,longish string which will compress really well' | head -n 10000000 > /tmp/source.csv

# Copy this file in hdfs
hdfs dfs -mkdir /tmp/compressiontest
hdfs dfs -copyFromLocal /tmp/source.csv /tmp/compressiontest/source.csv

Then I loaded this data in 2 tables, compressed and uncompressed, directed with the setting hive.exec.compress.output.

LOCATION '/tmp/compressiontest'

CREATE TABLE shouldbecompressed ( id INT, s STRING)
LOCATION '/tmp/shouldbecompressed';

CREATE TABLE shouldbeuncompressed (id INT, s STRING)
LOCATION '/tmp/shouldbeuncompressed';

set hive.exec.compress.output=true;
INSERT INTO shouldbecompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldbecompressed;

set hive.exec.compress.output=false;
INSERT INTO shouldbeuncompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldbeuncompressed;

I still have compressed:false, but what happens on disk?

hdfs dfs -du -s -h /tmp/should\*

42.5 K /tmp/shouldbecompressed
39.8 K /tmp/shouldbeuncompressed

Hum, apparently both tables are compressed? It turned out that I forgot about an orc parameter (orc.compress), set by default to ZLIB for me. The other valid values are SNAPPY or NONE. So let’s try again:

CREATE TABLE shouldreallybecompressed ( id INT, s STRING)
LOCATION '/tmp/shouldreallybecompressed'
TBLPROPERTIES ("orc.compress"="ZLIB")

CREATE TABLE shouldreallybeuncompressed ( id INT, s STRING)
LOCATION '/tmp/shouldreallybeuncompressed'
TBLPROPERTIES ("orc.compress"="NONE")

set hive.exec.compress.output=true;
INSERT INTO shouldreallybecompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldreallybecompressed;

set hive.exec.compress.output=false;
INSERT INTO shouldreallybeuncompressed SELECT * FROM sourcedata;
SELECT COUNT(*) FROM shouldreallybeuncompressed;
hdfs dfs -du -s -h /tmp/should\*

42.5 K /tmp/shouldbecompressed
39.8 K /tmp/shouldbeuncompressed
38.8 K /tmp/shouldreallybecompressed
3.8 M /tmp/shouldreallybeuncompressed

So indeed, the uncompressed table is less compressed, but is still a far cry from the 500MB I expected.

Long story short, ORC does some compression on its own, and the parameter orc.compress is just a cherry on top. on a side note, using SNAPPY instead of ZLIB the data size was 197k instead of 44k.

To look even deeper, hive on the command line has an option –orcfiledump, which will give some metadata about an orc file. So looking at a compressed file:

hive --orcfiledump /tmp/shouldbecompressed/000007_0

We can see, among other lines:

# yes, compressed!
Compression: ZLIB

# This is the buffer size, nothing to do with actual data size
Compression size: 262144

File length: 5459 bytes

For an uncompressed file:

hive --orcfiledump /tmp/shouldreallybeuncompressed/000000_0

Compression: NONE
File length: 136741 bytes

Long story short, the output of desc extended regarding compression is useless. And all my tables are indeed compressed.

This example was a bit artificial as the source file was very compressible. With another source file more random, generated as follow:

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold | head -c 500000k | awk '{print "1," $0}'> source.csv

Then the size on disk becomes:

370.4 M /tmp/shouldbecompressed
370.4 M /tmp/shouldbeuncompressed
370.4 M /tmp/shouldreallybecompressed
490.0 M /tmp/shouldreallybeuncompressed

And just because I am nice, here are the lines to clean up your droppings:

drop table shouldbecompressed;
drop table shouldbeuncompressed;
drop table shouldreallybeuncompressed;
drop table shouldreallybecompressed;
drop table sourcedata;

Environment variables in Hive

I will here explain how to set and use variables in hive.

How to set a variable

Just use the keyword set

set foo=bar;
set system:foo=bar

Alternatively, for the hiveconf namespace you can set the variable on the command line:

beeline ----hiveconf foo=bar

How to use a variable

Wherever you want to use a value, use this syntax instead: ${namespace:variable_name}. For instance:

select '${hiveconf:foo}', '${system:foo}', '${env:CLASSPATH}';

Note that variables will be replaced before anything else happens. This means that this is perfectly valid:

set t=employees;
set verb=desc;
${hiveconf:verb} ${hiveconf:t};

But this will not do what you expect (hint: you will end up with 4 quotes in your select statement):

set s='Hello world';
select '${hiveconf:s}';


Furthermore, it means that you need to take care of your data type. As selecting a bare string is not valid, so is the following code invalid as well:

set v=astring;
select ${hiveconf:v};

You will get:

Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference ‘astring’: (possible column names are: ) (state=42000,code=10004)

In our case, you just need to quote the variable.

Note that this would work as-is with an int as a bare int is valid in the select statement.

Another caveat is to make sure the variable exists, otherwise you will either get the variable literal for quoted variables:

select ‘${hiveconf:donotexists}’;
| _c0 |
| ${hiveconf:donotexists} |

Either an unhelpful message for unquoted variables:

> select ${hiveconf:doesnotexist};
Error: Error while compiling statement: FAILED: ParseException line 1:7 cannot recognize input near ‘$’ ‘{‘ ‘hiveconf’ in select clause (state=42000,code=40000)

If you only want to see the value of a variable, you can just use set as well:

set hiveconf:foo;

How to List variables

Just use SET;, but this will output a massive unreadable list. You are better off redirecting this output to a file, e.g.

beeline -e 'SET;' | sed 's/\s\+/ /g'> set.out

Note that I squash the spaces here. As the columns are aligned and some values are very long strings, squashing makes reading much easier.

Then if you want to see a specific set of variables, you can just run:

# system variables
grep '| system:' set.out

# Env variables
grep '| env:' set.out

# other variables
cat set.out | grep -v '| env:' | grep -v '| system:'


Hive has 3 namespaces for variables: hiveconf, system and env.


Hiveconf is the namespace used when you use set or when you give a variable on the command line with –hiveconf foo=bar. Note that you can set those without specifying the namespace, but you always need to specify the namespace when using them.

set foo=bar;
select "${hiveconf:foo}";


This is the namespace of the shell environment variables. You can easily get them with the ${env} prefix:

SELECT "${env:hostname}";

I specifically chose this variable. If you run this query yourself, you will see that it is the environment of the hive server which is used, not the environemnt of your client. This limits a lot the use of environment variables.

Note that environment variables cannot be set.


Those will contain for instance jvm settings, logfile destinations and more.


Count lines of multiple files in hdf

This is a situation often popping up here and there and which I had recently. Given a bunch of files on hdfs, how to get the total line count?

It turns out that there are a lot of options which I will explain here. The goal is to count the lines of all csv files in a a specific directory, $d.

After ll the solution I will show some benchmarking.


The sysadmin way

Set up a fuse mountpoint to hdfs (I always do that, it is very stable, uses no resource and helps for many things) at $mountpoint.

wc -l $mountpoint$d/*.csv | grep total

Easy and sweet, but has the drawback that all data is downloaded locally before being processed.

The dutiful way

We are using hadoop, so let’s use hadoop command line.

hdfs dfs -ls $d/\*.csv | awk '{print $8}' | xargs hdfs dfs -cat | wc -l

Note that this use xargs. If you have too many files you might need to build a loop in, but invoking the hdfs command many times will incur a lot of jvm startup time penalty.

The issue here is as well that all data is downloaded locally.

The old school way

Let’s just use a streaming job:

hdfs hadoop jar $HADOOP_MAPRED_HOME/hadoop-streaming.jar \
     -Dmapred.reduce.tasks=1 \
     -input $d/\*.csv \
     -output /tmp/streamout \
     -mapper "bash -c 'paste <(echo "count") <(wc -l) '" \
     -reducer "bash -c 'cut -f2 | paste -sd+ | bc'"

The only issue with a streaming job is that the output is written in a file on hdfs (the -output parameter) not on stdout.

The tormented soul

Streaming jobs are nice but this bash is ugly and python is cool for data, right?

import sys
for l in sys.stdin:
import sys
for l in sys.stdin:

And then just run.

hadoop jar $HADOOP_MAPRED_HOME/hadoop-streaming.jar \
   -Dmapred.reduce.tasks=1 \
   -input $d/\*.csv \
   -output /tmp/streamout \
   -mapper "/usr/bin/python3 $*" \
   -reducer "/usr/bin/python3 $*"


Hey do not forget about pig!

csvs = LOAD '$d/*.csv';
csvgroup = GROUP csvs ALL;
cnt = FOREACH csvgroup GENERATE COUNT(csvs);
dump cnt;

Run with:

pig cnt.pig


Pyspark is all the rage

print cnt


Ok this was fun, but what works best?

I ran 2 sets of test, one on a directory with a few (~5) big files, one on a directory with many (250) tiny files. I ran the count 10 times for each test, the results are in seconds and lower is better.

Few big files Many small files
Mean Median Stdev Mean Median Stdev
Sysadmin 49.6 46.5 12.17 1 1 0
Dutiful 49.8 46.5 11.71  5.1 5 0.32
Connoisseur 25.7 25 2.31 17.2 17 0.42
Old school 21.5 21.5 0.53 47.7 48 1.06
Tormented 28.3 28 0.82 39.9 39.5 1.1
Hipster 20.2 20 1.03 9.4 9 0.52

A few things are apparent:

  • Different use cases will call for different options,
  • When data is downloaded (sysadmin via the mountpoint or dutiful via hdfs -cat) the variance is higher. Weirdly enough for my tests the both started at about 37 seconds, and increased at every run by 5 seconds to stay stable at 63 seconds,
  • pyspark is the clear winner.

Hadoop metrics in graphite

I will not present graphite here, if you end up reading this I assume you already have a graphite instance up and running. If not it is a matter of less than an hour to have a usable instance.

Hadoop uses metrics2 which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.

Those metrics can be very easily exported to graphite to then be sliced and diced to your heart’s content.

You only need to modify the file by adding the following snippet:

# Sampling period

# Grahite sink class

# Location of your graphite instance

# Define for each metric group (* in *.prefix) how it should be named
# in graphite (part after the =)

In Ambari, just go to HDFS > Config > Advanced, the location for other distributions should be trivial to find.

After that restart hdfs and all relevant services you asked to monitor (if you asked to monitor resourcemanager, restart the resource managers and so on).

That’s it, you’re set.

If you are on HDP, you can go a bit further. HDP actually ships with a grafana instance (if you installed Ambari metrics) which can use graphite a data source. Data will be the same, display will be a tad prettier.

This uses graphite web (port 80 per default) which needs to enable CORS. You can do it in apache (the default graphite web http server) by adding this line in your graphite vhost:

Header set Access-Control-Allow-Origin "*"


Extracting queries from Hive logs

Hive logs are very verbose, and I personally find it a pain to wade through them when I try to understand which queries my ETL tool decided to generate.

To help with this, I created this small python script which looks at hive logs files and output the SQL queries and only the queries, with some information about them if known: time started, duration, success.


./ --help 
usage: [-h] [--since SINCE] [--to TO] [--logdir LOGDIR]
              [--glob LOGFILE_GLOB]
              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Displays queries ran on Hive.

optional arguments:
 -h, --help show this help message and exit
 --since SINCE how far to look back. (default: 15m)
 --to TO How far to look forward. (default: now)
 --logdir LOGDIR Directory of hive log files. (default: /var/log/hive)
 --glob LOGFILE_GLOB Shell pattern of hive logfiles inside their logdir.
 (default: hiveserver2.log*)
 Log level. (default: warn)

Sample output:

Started at 2017-06-22 05:30:58 for 12.788000s by hive on (Probably success). (Thread id: 79733, query id: hive_20170622053058_676612af-7bb8-4c4b-8fce-51bd1ae7be71, txn id: 0):
ORDER BY -- required for next step

Started at 2017-06-22 05:31:25 for 0.018000s by Unknown on Unknown (Probably success). (Thread id: 79770, query id: hive_20170622053125_7d8e644a-5c23-4ca8-ab0f-20becdd65c3b, txn id: Unknown):
use events

Started at 2017-06-22 05:31:25 for Unknowns by Unknown on Unknown (FAILED). (Thread id: handler-46, query id: Unknown, txn id: Unknown):
 [big sql...]
 ) as r

) src

 [big sql...]

  [more sql]
Error: ParseException line 36:4 cannot recognize input near '$' '{' 'SEQ_CHECKER_SQL' in expression specification

As you can see:

  • If user, hostname and duration are know they are displayed,
  • query is displayed with the same formatting as it was sent, inclusive comments,
  • error (if any) is showed. In my case, a variable is not expanded by the ETL tool.

You can find the source on github.


Hortonworks, Cloudera or MapR?

This is one of the big questions when you start your first Hadoop project. Hadoop is Hadoop, right? So it should not matter which distribution you use? There is some truth is there, but there still are quite a few differences between these vendors, worth knowing about. After all, Linux is Linux, right? Debian or Redhat should not matter? You can just straight away to the quick answer, or carry on reading for more details.


If you want to know more about Hadoop itself, you can check out the official Apache site, or just the wikipedia page for history and so on.

There are 3 big Hadoop distributions. Apache Hadoop itself, the root of them all, is not a distribution per se, as you can download each components individually but a lot of elbow grease is needed to tie everything together. The 3 main vendors bundle Apache Hadoop with other tools, open source as well as their own proprietary bricks to create distributions. Those are Cloudera, MapR and Hortonworks. There are other vendors as well, Microsoft (HdInsight, cloud only), Pivotal (Pivotal HD) and other I forget, but I concentrate on the big 3 here.

Quick answer

Use MapR if:

  • Performance is paramount,
  • You are a big company with strong audit requirements,
  • You know you will pay a licence for support.

Use Hortonworks if:

  • Open source is very important to you,
  • You do not want to pay for a licence but still want to do as much as possible (including security, authorisation),
  • You already have a datawarehouse (Terradata, Oracle, Vertica…) that you plan to carry on using but could offload or which does not allow all processing you plan to do.

Use Cloudera if:

  • You need to be PCI compliant
  • You want as much as possible automated for you, at the potential cost of a licence

Longer answer and description

A generic comment first. If you already plan to use some specific tools or Linux distributions, make sure that they are compatible for your version. For instance Tez does not run on Coudera, Impala would have problems on Hortonworks, and MapR does not support Debian (but Ubuntu).


MapR biggest differentiators are its filesystem and database, said to improve a lot the overall performance because it is highly optimised and skips the jvm and ext4 layers, while still being compatible with HDFS and HBase APIs. Their filesystem is a real filesystem, not append-only as HDFS is, and can be mounted via NFS which makes some administration tasks much easier.

MapR strives to support the whole Hadoop ecosystem (for instance Tez, Impala, Spark…) which on paper means that more tools should be supported by MapR than by the other distributions.

MapR is the only one to support volumes, which can give you very strong security and multi-tenancy, as you can control with a very fine grain who can access which volume.

On the bad side, MapR is pretty limited in its free version. HA for instance is only available with a licence. (EDIT: see comment from Anoop Dawar below, failover is now part of M3. the free version.)

As a nice starting point, you can spawn AWS instances configured for MapR, where the cost includes licence and support, without having to commit for a year. Usually AWS instances are about 2 months after the main MapR release due to extra testing and procedures.


Cloudera is the oldest Hadoop distribution. Their vision is to fully replace the warehouse by creating an Enterprise Data Hub and help the user a lot on the way.

The biggest strength of Cloudera is their automation. Cloudera manager and Navigator are amazing tools doing a lot for you, and are said to be superior to the equivalent of other distributions. That said, they are closed source, and although the manager is available for free, the navigator (security, governance) is not.

Another very strong point of Cloudera is Impala, a very fast open-source in-memory SQL database.

Cloudera is the only PCI-compliant distribution.

Cloudera claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.


Hortonworks vision is not to fully replace a warehouse, but to use existing warehouse to provide offloading or new processes, thanks to the integration with multiple partners.

Hortonworks is a fully open source distribution. There is no licence to pay, only support if you so wish. The definition of open source for Hortonworks is very strict. For them open source means managed by a committee to not have ‘dictatorial’ open-source, where a project is technically open source, but only one company can accept (and usually refuses) contributions.

Ambari is the management tool for Hortonworks. Although it is quite new and did not have all the features you would want from a manager, it is improving at great speed and is supported by multiple organisations, thanks to being open-source.

Hortonworks supports Debian, but with an extra 1-month delay due to extra tests needed in comparison with the standard Redhat/CentOS version.

Hortonworks claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.

Price comparison

This is always a big question, isn’t it? Here are a few prices I could gather. Those are just ballpark figures, and could of course be negociated.

MapR support (24/7) is around 4k$/server/year. This goes up to 6k if you want to include MapRDB as well. This include licence and support.

Cloudera support (24/7) is around 6.5k€/server/year. This includes server and licence. Note that Cloudera has multiple options, where you can elect to have full support (Enterprise), support for only one element (Flex) or support for only the core Hadoop, ie. HDFS, Hive and the like (Basic). Flex and enterprise provide the Navigator, but Basic is very cheap (500€/server/year).

Hortonworks does not provide a licence as it is fully opensource, but support (24/7) is priced at about 3.5k€/server/year.

Vendor lockin

This is usually a big concern, specially when talking about non open-source tools. I would claim that it is a non-problem.

Your data is always available via standard tools, and that is what matters the most. You will always be able to retrieve or export it in multiple ways. The rest (administration basically) is tied to your distribution anyway. If you do everything with the source Apache and puppet, use Ambari, Cloudera or MapR manager, it is not transferable to the other tool. In short, you are locked – administration-wise – anyway.

Tutorial: Install CDH 5 for testing on one machine

This is a tutorial after my own experience to install CDH 5.4 via the Cloudera Manager on one machine only for test purposes. This is based on a Mint machine (based on Ubuntu/Debian). Commands will thus be given with apt-get, you can probably just replace apt-get by yum if you are trying to do this on a Redhat-based server.



Install ssh server on your machine:

apt-get install openssh-server

Make sure you can connect as root if you do no want everything to run under one user, which is a question which will be asked during the installation process (screen 3). Running all under one user is nice for a one-machine test, but I believe you might run into issues if you later want to extend your cluster. For this reason I chose the normal, multi user (hdfs, hadoop and so on) installation. Cloudera actually gives a warning for the single user installation:

The major benefit of this option is that the Agent does not run as root. However, this mode complicates installation, which is described fully in the documentation. Most notably, directories which in the regular mode are created automatically by the Agent, must be created manually on every host with appropriate permissions, and sudo (or equivalent) access must be set up for the configured user.

On my machine, I for instance needed to update /etc/ssh/sshd_config to have the line :

PermitRootLogin yes

Other packages

For the heartbeat, you need supervisor and the command ntpdc:

apt-get install supervisor ntp

Supported platforms

Officially, Cloudera can install on some versions  of Debian or Ubuntu. If you use a derivative, it might work (YMMV), but Cloudera will refuse to install. You can fool the installer by changing the lsb-release file:

sudo mv /etc/lsb-release /etc/lsb-release.orig
sudo ln -s /etc/upstream-release/lsb-release /etc/lsb-release
# After installation you can revert with:
sudo rm /etc/lsb-release
sudo mv /etc/lsb-release.orig /etc/lsb-release


Follow the documentation from cloudera:

chmod u+x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin

Note that it will install the oracle JDK (1.7 for CDH 5.4.0), and postgres. A the end your browser should open and connect you to http://localhost:7180. Do not panic if the connection cannot be established at first. Try again in a minute or two, to give the servers enough time to properly startup. Note that if your machine is not very powerful, it can take 2 minutes. The username and password there are admin/admin.


IP address

Click a few times continue, and you will be asked to enter an IP address. As you are only testing on your machine, type yours, which you can find via hostname -I in your terminal. Make sure to use your real IP, not The reason is that if later you extend your cluster with another node, and this node number 2 (n2) wants to access node number 1 (n1), it would try to access n1 via, which would of course point to n2 itself. This is a general good practice. As a host will be added to the cloudera manager if it heartbeats, a partial installation might make a ghost host (localhost) appear in ‘Currently Managed Host’. In that case, make sure they are not selected before carrying on.

Acquiring installation lock

If you are blocked on ‘Acquiring installation lock’. Click ‘Abort’, then:

rm -rf /tmp/scm_prepare*
rm -f /tmp/.scm_prepare_node.lock
# if above is not enough:
service cloudera-scm-agent restart
service cloudera-scm-server-db restart
service cloudera-scm-server restart

and ‘retry failed host’

Full restart

If like me you screwed up everything, you can always uninstall everything (make sure to say yes when asked to delete the database files). Cloudera explains (parts of) what to do,  but the violent and complete way is as follow, to do as root:


# kill any PID listed by this ps below:
ps aux | grep cloudera
# this command does it automatically
kill $(ps ax --format pid,command | grep cloudera | sed -r 's/^\s*([0-9]+).*$/\1/')
# purge all cloudera packages
apt-get purge cloudera-manager-server-db-2 cloudera-manager-server cloudera-manager-daemons cloudera-manager-agent 
# I am not so sure when this one is installed or not:
apt-get purge cloudera-manager-repository
# your choice, would clean up orphaned packages (postgres)
apt-get autoremove
# purge all droppings
rm -rf /etc/cloudera*
rm -rf /tmp/scm_prepare*
rm -f /tmp/.scm_prepare_node.lock
rm -rf /var/lib/cloudera*
rm -rf /var/log/cloudera*
rm -rf /usr/share/cmf
rm -rf /var/cache/yum/cloudera*
rm -rf /usr/lib/cmf

Could not connect to host monitor

After all is done with success everywhere, you go back to the home page and you see a lot of sad empty graphs with ‘query error’. This means that the management services are not running.

You can easily fix this by clicking on the top left ‘Add Cloudera Management Service’, and following the wizard from there.