ORC benchmarking

I need to use transactional tables in Hive, so I naturally use the ORC format. One limitation of those transactional tables is that they have to be bucketed (in Hive 2 at least. In Hive3 my understanding is that bucketing is not mandatory, but the default is equivalent to having 1 bucket).

It is tricky to find a good answer on how to use buckets: how many? on which columns? It usually boils down to:

Buckets are something that should be done for a concrete problem not just because you think you should have them. Normally I would not use them.

Which is not helpful when you actually have to use them.

So how to use buckets?

I did some benchmarking to try to get a valid answer. My dataset was as follow:

Data:
- 1B rows table worth 3 months of data, with a compound (logical) key: user id, and 4-level hierarchical identifier. This is about 15GB (unreplicated) on disk.
- 9 merges of 1 hour of data
Benchmarks:
- select count(*)
- select count(*) where hierarchy_level_1=something
- select 1 specific row
- big select with join on 2 other tables, resulting in 400k rows.

All benchmarks were run 5 times, doing one iteration for all tables before starting the next iteration to prevent caching.

I tested a few variations:

transactional table or not,
partitioned or not (partition was the highest level of my hierarchical key. About 200 hundred partitions, somewhat skewed),
1 to 32 buckets,
bloom filters on levels #1 and/or #2 my hierarchical key and/or on the user id.
I added as extra test some non-transactional tables, sorted on different fields.

It is tricky to properly display the data here so I show it as an image (click to enlarge), but you can find it as a pdf: ORC Benchmark – results or see it on google sheets.

What gives?

I had some expectations:

More buckets would mean more files so faster load and faster queries up to a point.
Bucket + partitions would create way too many files, at the cost of performance.
Proper bloom filters would make everything faster.

Well, looking at the results, here are the takeaways:

Bloom filters have basically no impact at all,
no partition usually helps,
1 bucket is horrible (but that I did expect),
too many buckets and partition together is bad,
the sweet spot (for me) is partition + 4 buckets.

I am very disappointed because I had a lot of hopes for bloom filters (I guess there is something I am doing wrong). The rest is more a confirmation of my expectations, with numbers to prove them.

The worst of all? My sweet spot is the one I am already using, so after all those tests (which ran almost 3 days straight) I have no new information to speed up my queries. Well, the silver lining is that at least I have data to show that I can pat myself on the back.

Cluster on which I tested: hdp 2.6, hive 2, 8 data nodes (8 CPU, 32GB) on AWS, doing nothing else apart from running these tests.

This Data Guy

Journey in a world of big(ger) data

ORC benchmarking

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply