ACID introduction

ACID transactions (update, merge) in Hive are awesome. The merge statement especially is incredibly useful.

Of course, not all table are ACID. You need to use ORC and have the table marked as ACID but those are easy steps:

create table something (id bigint) stored as orc tblproperties("transactional"="true")

Of course, in hdfs you cannot change a file once it is created. The standard way (not Hadoop specific) to handle changing immutable files is to have deltas. Each table will consist of a few directories:

the base directory: the data at creation time,
one or more delta directories: contains updated rows.

Every hive.compactor.check.interval seconds a compaction will happen (or at least the compactor will check if a compaction must happen). The compactor will compact the deltas and base directory in a new base directory, which will consist of a one new base directory with all the deltas applied to the original base directory.

The reason is that when you read an ACID table with many deltas, there is a lot more to read than for only a base directory as hive has to go through each and every delta. This has IOs and CPU costs, which are removed after compaction.

Naive ACID use

Every day I build a summary table gathering all data that changed in the last 24h as well as some related data. Many events are aggregated together. Think for instance about sending an email: I would get send data, open data maybe click data, bounce and a few others. I started building following the temporal flow:


create table summary (id bigint, number_sent bigint, number_open bigint...)stored as orc tblproperties("transactional"="true");

insert into summary select .... from sent;

merge into summary select ... from open;

merge into summary select ... from click;

...

Overall a few billions rows will be read. The final summary table will have about 100 millions rows.

What is interesting here is that I am inserting the biggest data first. This is the table summing up reads and writes per event while building the whole summary, which ran for about 4 hours:

Event	Bytes read (GB)	Bytes written (GB)
Total	516.5	104.1
Sent	16.2	87.1
Open	88.8	14.2
Click	101.5	1.7
Conversion	102.9	0.01
Bounce	103	1
Spam	104	0.11

Seeing 500GB read scared me a little, so instead of following the naive temporal flow, I started with the smallest event first to finish up with the biggest:

Event	Bytes read (GB)	Bytes written (GB)
Total	31.5	99.1
Conversion	0	0
Spam	0	0
Click	0.3	1.5
Bounce	1.7	1
Open	4.4	13.3
Sent	25.1	83.4

That’s much better already! The total number of bytes written does not change much (quite logical I suppose as the final data is the same) but the number of bytes read is only 6% of the original! Furthermore, it ran in 2h40 instead of 4 hours.

I added one last step. This summary data was written at user level. I actually needed to do one extra aggregation but I was worried about joining against the user table at every step, as the user table is actually quite big and joins are expensive. But well, I experimented, doing the aggregation at each step instead of doing one big aggregation at the end:

Event	Bytes read (GB)	Bytes written (GB)
Total	20.5	8.6
Conversion	0.2	0
Spam	1.2	0
Click	1.4	0.2
Bounce	1.5	0.2
Open	3.5	1.7
Sent	12.7	6.4

Total run time: 1.5 hours!

TL;DR

When using ACID deltas are expensive. When using HDFS writes are expensive. Order your processing to have a little of those as possible.

This Data Guy

Journey in a world of big(ger) data

The cost of ACID with ORC table

ACID introduction

Naive ACID use

TL;DR

Leave a comment Cancel reply

ACID introduction

Naive ACID use

TL;DR

Share this:

Related

Leave a comment Cancel reply