The cost of ACID with ORC table

ACID introduction

ACID transactions (update, merge) in Hive are awesome. The merge statement especially is incredibly useful.

Of course, not all table are ACID. You need to use ORC and have the table marked as ACID but those are easy steps:

create table something (id bigint) stored as orc tblproperties("transactional"="true")

Of course, in hdfs you cannot change a file once it is created. The standard way (not Hadoop specific) to handle changing immutable files is to have deltas. Each table will consist of a few directories:

  • the base directory: the data at creation time,
  • one or more delta directories: contains updated rows.

Every hive.compactor.check.interval seconds a compaction will happen (or at least the compactor will check if a compaction must happen). The compactor will compact the deltas and base directory in a new base directory, which will consist of a one new base directory with all the deltas applied to the original base directory.

The reason is that when you read an ACID table with many deltas, there is a lot more to read than for only a base directory as hive has to go through each and every delta. This has IOs and CPU costs, which are removed after compaction.

Naive ACID use

Every day I build a summary table gathering all data that changed in the last 24h as well as some related data. Many events are aggregated together. Think for instance about sending an email: I would get send data, open data maybe click data, bounce and a few others. I started building following the temporal flow:


create table summary (id bigint, number_sent bigint, number_open bigint...)stored as orc tblproperties("transactional"="true");

insert into summary select .... from sent;

merge into summary select ... from open;

merge into summary select ... from click;

...

Overall a few billions rows will be read. The final summary table will have about 100 millions rows.

What is interesting here is that I am inserting the biggest data first. This is the table summing up reads and writes per event while building the whole summary, which ran for about 4 hours:

Event Bytes read (GB) Bytes written (GB)
Total 516.5 104.1
Sent 16.2 87.1
Open 88.8 14.2
Click 101.5 1.7
Conversion 102.9 0.01
Bounce 103 1
Spam 104 0.11

Seeing 500GB read scared me a little, so instead of following the naive temporal flow, I started with the smallest event first to finish up with the biggest:

Event Bytes read (GB) Bytes written (GB)
Total 31.5 99.1
Conversion 0 0
Spam 0 0
Click 0.3 1.5
Bounce 1.7 1
Open 4.4 13.3
Sent 25.1 83.4

That’s much better already! The total number of bytes written does not change much (quite logical I suppose as the final data is the same) but the number of bytes read is only 6% of the original! Furthermore, it ran in 2h40 instead of 4 hours.

I added one last step. This summary data was written at user level. I actually needed to do one extra aggregation but I was worried about joining against the user table at every step, as the user table is actually quite big and joins are expensive. But well, I experimented, doing the aggregation at each step instead of  doing one big aggregation at the end:

Event Bytes read (GB) Bytes written (GB)
Total 20.5 8.6
Conversion 0.2 0
Spam 1.2 0
Click 1.4 0.2
Bounce 1.5 0.2
Open 3.5 1.7
Sent 12.7 6.4

Total run time: 1.5 hours!

TL;DR

When using ACID deltas are expensive. When using HDFS writes are expensive. Order your processing to have a little of those as possible.

Leave a comment