I had a (ORC) table with duplicated rows, which I wanted to remove. The query is quite simple:
merge into click as dst using ( select -- For all unique clicks... client_name , contact_id , ts_utc -- ... find the duplicates (cnt>1 in having) ... , count(*) cnt -- ... remember the first one loaded ... , min(load_ts) as first_load from click group by 1, 2, 3 having cnt > 1 ) as src -- ... once the first occurrence of the duplicates -- is found find all the duplicates ... on dst.client_name=src.client_name and dst.contact_id=src.contact_id and dst.ts_utc=src.ts_utc -- ... and if it is not the first one loaded ... when matched and src.first_load != dst.load_ts -- .. delete it. then delete ;
Trivial, right? Well it looks like you cannot do such a ‘self merge’ in hive. I ended up with this error:
java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116) [...]
The solution, once understood that a self merge is not allowed, is of course obvious: use a temporary table. Splitting my merge statement in 2 did the trick.
create temporary table clickdups stored as orc as select client_name , contact_id , ts_utc , count(*) cnt , min(load_ts) as first_load from click group by 1, 2, 3 having cnt > 1 ; merge into click as dst using clickdups as src on dst.client_name=src.client_name and dst.contact_id=src.contact_id and dst.ts_utc=src.ts_utc when matched and src.first_load != dst.load_ts then delete ;
On a side note I needed to tweak a lot the self-merge to prevent out of memory error. Those did not happen at all using the 2 steps solution.