I have a nice Hive warehouse, running data exports for some clients. All was working smoothly when a coworker came to me because a client complained that there were NULL
s in an export, whereas they should all be sanitised via a lot of coalesce
.
The reason ended up being newlines which I thought were properly handled, but as it turned out were not in all cases. This post explains what the issues are and how to fix them.
This is an example of what happened (simplified query to show the problem). A json string with newlines, exploded with a lateral view. Newlines are escaped so all should have worked.
SET system:disable.quoting.for.sv=false; with things as ( select 42 as id , '{ "item": "cheese" , "good": "camembert\\ncomté" , "tasty": "yes" }' as custom_fields ) select c.id , c.custom_fields , cv.good , cv.item , cv.tasty from (select * from things order by id) c lateral view json_tuple(c.custom_fields, "good", "item", "tasty") cv as good, item, tasty ;
If you run this code you might (depending on your setup) see a nice \n
in custom_fields
but an actual line breaks in good
which is not what you want.
Note that if you replace the from
clause by just from things c
the issue will not appear.
The problem is that in my example an intermediary map step is executed (because of the order by
). Data in this step was stored by default as textfile, where the row delimiter is, you can guess, \n
.
With the simplified from clause, there is no intermediary map step and the problem does not happen.
What we would like is to have the intermediary data stored as some sort of binary format, where \n
is not a separator. Luckily, there’s a setting for that: hive.query.result.fileformat. The default (up to Hive 2.0) was TextFile, which as you can guess will use newlines as row separator. From Hive 2.1 onward, the default became SequenceFile which is a binary format, compressible, which does not have the newlines issue.
You can either set this setting globally in hive-site, or add it before your query with set hive.query.result.fileformat=SequenceFile;
.
With Hive 2, this fixes the newlines, but the \n
completely disappear from the output. This is better than previously but not ideal.
With Hive 3, if you are using beeline, you can add the command line option --escapeCRLF=true
which will give you exactly the output you want.