Hive self merge

I had a (ORC) table with duplicated rows, which I wanted to remove. The query is quite simple:

merge into click as dst using (
    select
      -- For all unique clicks...
        client_name
      , contact_id
      , ts_utc
      -- ... find the duplicates (cnt>1 in having) ...
      , count(*) cnt
      -- ... remember the first one loaded ...
      , min(load_ts) as first_load
      from
        click
      group by
        1, 2, 3
      having cnt > 1
)
as src
-- ... once the first occurrence of the duplicates
-- is found find all the duplicates ...
on
        dst.client_name=src.client_name
    and dst.contact_id=src.contact_id
    and dst.ts_utc=src.ts_utc
-- ... and if it is not the first one loaded ...
when matched and src.first_load != dst.load_ts
-- .. delete it.
then delete
;

Trivial, right? Well it looks like you cannot do such a ‘self merge’ in hive. I ended up with this error:

java.lang.InterruptedException
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
 at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
 at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90)
 at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116)
 [...]

The solution, once understood that a self merge is not allowed, is of course obvious: use a temporary table. Splitting my merge statement in 2 did the trick.

create temporary table clickdups stored as orc as select
        client_name
      , contact_id
      , ts_utc
      , count(*) cnt
      , min(load_ts) as first_load
      from
        click
      group by
        1, 2, 3
      having cnt > 1
;

merge into click as dst using clickdups
as src
on
        dst.client_name=src.client_name
    and dst.contact_id=src.contact_id
    and dst.ts_utc=src.ts_utc
when matched and src.first_load != dst.load_ts
then delete
;

On a side note I needed to tweak a lot the self-merge to prevent out of memory error. Those did not happen at all using the 2 steps solution.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s