I need to use transactional tables in Hive, so I naturally use the ORC format. One limitation of those transactional tables is that they have to be bucketed (in Hive 2 at least. In Hive3 my understanding is that bucketing is not mandatory, but the default is equivalent to having 1 bucket).
It is tricky to find a good answer on how to use buckets: how many? on which columns? It usually boils down to:
Buckets are something that should be done for a concrete problem not just because you think you should have them. Normally I would not use them.
Which is not helpful when you actually have to use them.
So how to use buckets?
I did some benchmarking to try to get a valid answer. My dataset was as follow:
- Data:
- 1B rows table worth 3 months of data, with a compound (logical) key: user id, and 4-level hierarchical identifier. This is about 15GB (unreplicated) on disk.
- 9 merges of 1 hour of data
- Benchmarks:
- select count(*)
- select count(*) where hierarchy_level_1=something
- select 1 specific row
- big select with join on 2 other tables, resulting in 400k rows.
All benchmarks were run 5 times, doing one iteration for all tables before starting the next iteration to prevent caching.
I tested a few variations:
- transactional table or not,
- partitioned or not (partition was the highest level of my hierarchical key. About 200 hundred partitions, somewhat skewed),
- 1 to 32 buckets,
- bloom filters on levels #1 and/or #2 my hierarchical key and/or on the user id.
- I added as extra test some non-transactional tables, sorted on different fields.
It is tricky to properly display the data here so I show it as an image (click to enlarge), but you can find it as a pdf: ORC Benchmark – results or see it on google sheets.
What gives?
I had some expectations:
- More buckets would mean more files so faster load and faster queries up to a point.
- Bucket + partitions would create way too many files, at the cost of performance.
- Proper bloom filters would make everything faster.
Well, looking at the results, here are the takeaways:
- Bloom filters have basically no impact at all,
- no partition usually helps,
- 1 bucket is horrible (but that I did expect),
- too many buckets and partition together is bad,
- the sweet spot (for me) is partition + 4 buckets.
I am very disappointed because I had a lot of hopes for bloom filters (I guess there is something I am doing wrong). The rest is more a confirmation of my expectations, with numbers to prove them.
The worst of all? My sweet spot is the one I am already using, so after all those tests (which ran almost 3 days straight) I have no new information to speed up my queries. Well, the silver lining is that at least I have data to show that I can pat myself on the back.
Cluster on which I tested: hdp 2.6, hive 2, 8 data nodes (8 CPU, 32GB) on AWS, doing nothing else apart from running these tests.