Find a timezone offset in pure SQL in hive

Timezones are a pain. This is not new and every time you deviate from UTC this will bite you. That said sometimes you have to deviate from UTC, specially for the final display of a date if you want to show it in the local timezone from the reader. In that case adding an offset to be explicit will save some questions and uncertainty down the line.

There is no function get_offset_from_tz() in Hive, sadly. Using reflect() does not work either as the method call is to complex for reflect. Writing a UDF would be possible but feels overkill.

The solution I give here works in Hive and should probably work in all SQL variants as well apart from the variables.

The algorithm to find the offset is easy:

  • get the time in utc,
  • get the same in another timezone,
  • substract one from the other to get the offset,
  • format the offset in a standard way.

The main issue is that you cannot assign results to variables in SQL, meaning that many computations need to be duplicated. They will be optimised away, of course, but they make for an ugly code.

In hive, luckily, you can use variables. They cannot store results but are used as-is, a bit like macros, where the variable name is just replaced by its content which can be some piece of code.

This sets up the date to find the offset for as well as a few TZ for test.

-- Date to display. If you use this from a table you can
-- put here the column that would be used, eg. t.logdate.
set hivevar:D='2018-06-01 01:02:02';

-- A few tests:
-- positive offset +02:00 (in summer)
set hivevar:DISPLAY_TZ='Europe/Amsterdam';

-- negative offset -04:00 (in summer)
set hivevar:DISPLAY_TZ='America/New_York';

-- 0 offset
set hivevar:DISPLAY_TZ='UTC';

-- Non integer offset: +09:30
set hivevar:DISPLAY_TZ='Australia/Adelaide';

Those are the macros


-- Date displayed in the right TZ
set hivevar:dateintz=DATE_FORMAT(FROM_UTC_TIMESTAMP(${D}, ${DISPLAY_TZ}),"yyyy-MM-dd HH:mm:ss");
-- Offset in interval type
set hivevar:delta=cast(${dateintz} as timestamp) - cast(${D} as timestamp);

And the code itself, tiny and readable once variables are used:

select
  concat(
    -- date in TZ
    ${dateintz}

    -- sign
    , if(${delta} < interval '0' minute, '-', '+')

    -- hour
    , lpad(abs(hour(${delta})), 2, 0)

    , ':'

    -- minute
    ,lpad(minute(${delta}), 2, 0)
) as dtwithoffset
;

et voila

Advertisements

Hive, map, NULL and NPE

Checking for null values in a map column in Hive (1.2.1, Hortonworks) interestingly returns a null pointer exception:

create table npe (m map<bigint, bigint>);
select count(*) from npe where m is null;
Error: Error while compiling statement: FAILED: NullPointerException null (state=42000,code=40000)

The error happens at parsing time, when Hive tries to estimate data size. From hiveserver2.log:

Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.stats.StatsUtils.getSizeOfMap(StatsUtils.java:1096)

Interestingly, getting not null is fine:

select count(*) from npe where m is not null; -- returns 0

If you think like me, you will think ‘haha! not not null should work!’

select count(*) from npe where not m is not null; -- does not work

If you are smarter than me, you will have guessed before trying that Hive optimises the double negation away, and gives another NPE.

But going in this direction, we can still trick Hive by casting the boolean to int:

select count(*) from npe where int(m is not null)=0; -- works

This happens either without data either when there are real NULL in the table. By real NULL I mean that a SELECT would show NULL, which  happens only in the case where you add a column to an existing table. Indeed, you cannot yourself insert NULL into a complex column:

with a as (select null) insert into npe select * from a;
Error: Error while compiling statement: FAILED: SemanticException [Error 10044]: Line 1:36 Cannot insert into target table because column number/types are different 'npe': Cannot convert column 0 from void to map. (state=42000,code=10044)

You have to create an empty map object:

with a as (select map(cast(null as bigint), cast(null as bigint))) insert into npe select * from a;

Then of course the empty map object is not (a real) NULL and if you want to look for null you have to fudge a bit, looking at the size of the map for instance:

select m, size(m), isnull(m) from npe;
+----+-----+--------+--+
| m  | _c1 | _c2    |
+----+-----+--------+--+
| {} |  0  | false  |
+----+-----+--------+--+