I had a (ORC) table with duplicated rows, which I wanted to remove. The query is quite simple:
merge into click as dst using (
select
-- For all unique clicks...
client_name
, contact_id
, ts_utc
-- ... find the duplicates (cnt>1 in having) ...
, count(*) cnt
-- ... remember the first one loaded ...
, min(load_ts) as first_load
from
click
group by
1, 2, 3
having cnt > 1
)
as src
-- ... once the first occurrence of the duplicates
-- is found find all the duplicates ...
on
dst.client_name=src.client_name
and dst.contact_id=src.contact_id
and dst.ts_utc=src.ts_utc
-- ... and if it is not the first one loaded ...
when matched and src.first_load != dst.load_ts
-- .. delete it.
then delete
;
Trivial, right? Well it looks like you cannot do such a ‘self merge’ in hive. I ended up with this error:
java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116) [...]
The solution, once understood that a self merge is not allowed, is of course obvious: use a temporary table. Splitting my merge statement in 2 did the trick.
create temporary table clickdups stored as orc as select
client_name
, contact_id
, ts_utc
, count(*) cnt
, min(load_ts) as first_load
from
click
group by
1, 2, 3
having cnt > 1
;
merge into click as dst using clickdups
as src
on
dst.client_name=src.client_name
and dst.contact_id=src.contact_id
and dst.ts_utc=src.ts_utc
when matched and src.first_load != dst.load_ts
then delete
;
On a side note I needed to tweak a lot the self-merge to prevent out of memory error. Those did not happen at all using the 2 steps solution.