Newlines in Hive

I have a nice Hive warehouse, running data exports for some clients. All was working smoothly when a coworker came to me because a client complained that there were NULLs in an export, whereas they should all be sanitised via a lot of coalesce.

The reason ended up being newlines which I thought were properly handled, but as it turned out were not in all cases. This post explains what the issues are and how to fix them.

This is an example of what happened (simplified query to show the problem). A json string with newlines, exploded with a lateral view. Newlines are escaped so all should have worked.

SET system:disable.quoting.for.sv=false;
with things as (
    select
        42 as id
      , '{   "item": "cheese"
           , "good": "camembert\\ncomté"
           , "tasty": "yes"
         }' as custom_fields
)
select
    c.id
  , c.custom_fields
  , cv.good
  , cv.item
  , cv.tasty
from
    (select * from things order by id) c
lateral view
    json_tuple(c.custom_fields, "good", "item", "tasty") cv
    as good, item, tasty
;

If you run this code you might (depending on your setup) see a nice \n in custom_fields but an actual line breaks in good which is not what you want.

Note that if you replace the from clause by just from things c the issue will not appear.

The problem is that in my example an intermediary map step is executed (because of the order by). Data in this step was stored by default as textfile, where the row delimiter is, you can guess, \n.

With the simplified from clause, there is no intermediary map step and the problem does not happen.

What we would like is to have the intermediary data stored as some sort of binary format, where \n is not a separator. Luckily, there’s a setting for that: hive.query.result.fileformat. The default (up to Hive 2.0) was TextFile, which as you can guess will use newlines as row separator. From Hive 2.1 onward, the default became SequenceFile which is a binary format, compressible, which does not have the newlines issue.

You can either set this setting globally in hive-site, or add it before your query with set hive.query.result.fileformat=SequenceFile;.

With Hive 2, this fixes the newlines, but the \n completely disappear from the output. This is better than previously but not ideal.
With Hive 3, if you are using beeline, you can add the command line option --escapeCRLF=true which will give you exactly the output you want.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s