In my ETL flow, I need to merge a source table in a destination table, in Hive. This turned out to be much slower than expected so I had to dig around a lot and these are the results I discovered.

Context

Some data is coming from kafka, written as avro files on hdfs. These avro files are used to create an external table, which is then merged every day into the final ORC table. The external data files are then moved out of the way, meaning that the next ETL run will have a brand new external table to be fully merged into the destination table.

SQL


set hive.merge.cardinality.check=false;
set domainregexp='.*@(.*?)$';
MERGE INTO contact dst
USING (
  SELECT

    -- DISTINCT fields
      client -- partition column
    , user_id as id
    , ct.cid as cid
    -- other fields
     , email
     , lang
     -- note: domain is around here, but is computed from email. I compute
     -- it only when needed to prevent useless processing.
    , CAST(timestamp_ms_utc AS TIMESTAMP) AS ts_utc

    , ROW_NUMBER() OVER (
      PARTITION BY client
        , ct.cid
        , user_id
      ORDER BY timestamp_ms_utc DESC
     ) as r

  FROM
    external table
-- campaign_id is a stupid struct<long:bigint,array:array<bigint>>.
  -- Let's sanitise it.
  LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
  dst.client = src.client
  AND dst.campaign_id = src.cid
  AND dst.id = src.id

-- On match: keep latest loaded
WHEN MATCHED
 AND dst.updated_on_utc < src.ts_utc
 AND src.r = 1
THEN UPDATE SET
  -- other fields
    email = src.email
  , domain = regexp_extract(src.email, ${hiveconf:domainregexp}, 1)
  , lang = src.lang
  , updated_on_utc = src.ts_utc

WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
   src.id
 , src.cid

 , src.email
 , regexp_extract (src.email, ${hiveconf:domainregexp}, 1)
 , src.lang

 , src.ts_utc -- insert_date
 , src.ts_utc -- update_date

 , src.client -- partition column
)
;

This statement:

reads the source table,
explodes an array (campaign_id),
orders the rows within the same ‘unique’ key (ROW_NUMBER()),
updates or inserts the first unique row.

Problem

Merging takes exponentially longer. Merging the first day into the (empty) destination table takes about 30 minutes. The second day takes about 1.5 hour. The third day takes 4 hours. I stopped there.

What could go wrong?

Many things as it turned out.

(Attempted) Solutions

SQL tweaking

My first guess was that my SQL was not great. Here is what I tried:

Removing the regex. No impact.
Create a temporary table without duplicates and merge that one. Negative impact (4x longer).
Execute the merge per partition, one by one. Very negative impact.
Replace the source table by a subquery to filter out the r=1 before the merge. Negative impact (20% longer).
Create a table with exaclty the same structure as the destination table as a temporary table and merge that one. Negative impact (30% longer).
Pre-explode the lateral view earlier in the process (25% longer).

Apparently my SQL was quite good, so I had to look elsewhere.

Java heap

It turned out that many of my services were under configured. I increased the datanode heap, namenode heap, hive metastore heap and this all already made a big difference in speed but it was not enough.

Small files

This was a massive issue.

I had 3 source files per table per minute. On top of this, I had some aggressive partitioning and bucketing (buckets are mandatory for ACID tables, ACID tables are mandatory for a merge).

Updating this to have 3 source files per hour and having only 4 buckets per table instead of 64 gave me great performance. I am still not fully clear about the impact of bucketing but this will be a question for later if I notice other performance problem. I have enough on my plate to not do premature optimisation.

Final solution

In four words: bigger heap, less files.

My initial 30 minute merge in an empty table is now done in about 8 minutes in a table with 145M rows in which 35M are merged daily.

4 thoughts on “Why is my hive MERGE statement slow?”

Jacob Mosby on January 16, 2019 at 10:48 said:

Hi,

What Hive version are you using on your tests?

Thanks!

Reply ↓
- This data guy on January 16, 2019 at 11:22 said:
  
  This was hive 2.1 from hdp2.6.
  
  Reply ↓
Jeff Watson on April 2, 2019 at 21:43 said:

So was “reduce the small files” related to the bucketing on the source tables since the merge only really need the acid enabled on the target tables. I’m just starting to play with the SQL merge on tables with millions and billions of rows in the target tables. So far the source tables are external text tables. Wondering if I need the overhead of converting them to Avro or if that will have a marginal impact.

Good article, thanks for sharing the experiences.

Reply ↓
- This data guy on April 3, 2019 at 20:19 said:
  
  If you read your external tables once to merge them into a read-optimised table (stored as ORC, probably), I doubt that the Avro step is needed. Hive will need to read the tables once and will store them in its internal representation before processing anyway, so converting to avro (row format as well) and then merging would according to me be a waste of resource.
  Of course, the ideal option would be to do some benchmarking, but I’m pretty sure of the results.
  
  Reply ↓

This Data Guy

Journey in a world of big(ger) data

Why is my hive MERGE statement slow?

Context

SQL

Problem

(Attempted) Solutions

SQL tweaking

Java heap

Small files

Final solution

4 thoughts on “Why is my hive MERGE statement slow?”

Leave a comment Cancel reply

Context

SQL

Problem

(Attempted) Solutions

SQL tweaking

Java heap

Small files

Final solution

Share this:

Related

4 thoughts on “Why is my hive MERGE statement slow?”

Leave a comment Cancel reply