ORC benchmarking

I need to use transactional tables in Hive, so I naturally use the ORC format. One limitation of those transactional tables is that they have to be bucketed (in Hive 2 at least. In Hive3 my understanding is that bucketing is not mandatory, but the default is equivalent to having 1 bucket).

It is tricky to find a good answer on how to use buckets: how many? on which columns? It usually boils down to:

Buckets are something that should be done for a concrete problem not just because you think you should have them. Normally I would not use them.

Which is not helpful when you actually have to use them.

So how to use buckets?

I did some benchmarking to try to get a valid answer. My dataset was as follow:

  • Data:
    • 1B rows table worth 3 months of data, with a compound (logical) key: user id, and 4-level hierarchical identifier. This is about 15GB (unreplicated) on disk.
    • 9 merges of 1 hour of data
  • Benchmarks:
    • select count(*)
    • select count(*) where  hierarchy_level_1=something
    • select 1 specific row
    • big select with join on 2 other tables, resulting in 400k rows.

All benchmarks were run 5 times, doing one iteration for all tables before starting the next iteration to prevent caching.

I tested a few variations:

  • transactional table or not,
  • partitioned or not (partition was the highest level of my hierarchical key. About 200 hundred partitions, somewhat skewed),
  • 1 to 32 buckets,
  • bloom filters on levels #1 and/or #2  my hierarchical key and/or on the user id.
  • I added as extra test some non-transactional tables, sorted on different fields.

It is tricky to properly display the data here so I show it as an image (click to enlarge), but you can find it as a pdf: ORC Benchmark – results or see it on google sheets.


orcbench

What gives?

I had some expectations:

  • More buckets would mean more files so faster load and faster queries up to a point.
  • Bucket + partitions would create way too many files, at the cost of performance.
  • Proper bloom filters would make everything faster.

Well, looking at the results, here are the takeaways:

  • Bloom filters have basically no impact at all,
  • no partition usually helps,
  • 1 bucket is horrible (but that I did expect),
  • too many buckets and partition together is bad,
  • the sweet spot (for me) is partition +  4 buckets.

I am very disappointed because I had a lot of hopes for bloom filters (I guess there is something I am doing wrong). The rest is more a confirmation of my expectations, with numbers to prove them.

The worst of all? My sweet spot is the one I am already using, so after all those tests (which ran almost 3 days straight) I have no new information to speed up my queries. Well, the silver lining is that at least I have data to show that I can pat myself on the back.

Cluster on which I tested: hdp 2.6, hive 2, 8 data nodes (8 CPU, 32GB) on AWS, doing nothing else apart from running these tests.