Checking for null values in a map column in Hive (1.2.1, Hortonworks) interestingly returns a null pointer exception:
create table npe (m map); select count(*) from npe where m is null; Error: Error while compiling statement: FAILED: NullPointerException null (state=42000,code=40000)
The error happens at parsing time when Hive tries to estimate data size. From hiveserver2.log:
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.stats.StatsUtils.getSizeOfMap(StatsUtils.java:1096)
Interestingly, getting not null is fine:
select count(*) from npe where m is not null; -- returns 0
If you think like me, you will think ‘haha! not not null should work!’
select count(*) from npe where not m is not null; -- does not work
If you are smarter than me, you will have guessed before trying that Hive optimises the double negation away, and gives another NPE.
But going in this direction, we can still trick Hive by casting the boolean to int:
select count(*) from npe where int(m is not null)=0; -- works
This happens either without data either when there are real NULLs in the table. By real NULL I mean that a SELECT would show NULL, which happens only in the case where you add a column to an existing table. Indeed, you cannot yourself insert NULL into a complex column:
with a as (select null) insert into npe select * from a; Error: Error while compiling statement: FAILED: SemanticException [Error 10044]: Line 1:36 Cannot insert into target table because column number/types are different 'npe': Cannot convert column 0 from void to map. (state=42000,code=10044)
You have to create an empty map object:
with a as (select map(cast(null as bigint), cast(null as bigint))) insert into npe select * from a;
Then, of course, the empty map object is not (a real) NULL and if you want to look for null you have to fudge a bit, looking at the size of the map for instance:
select m, size(m), isnull(m) from npe;
+----+-----+--------+--+ | m | _c1 | _c2 | +----+-----+--------+--+ | {} | 0 | false | +----+-----+--------+--+