Easy test data with Hive

Testing a query on a small dataset, especially if you need to carefully check your joins is usually made by creating a few temporary tables with hand-crafted data. This is a true and tested method, but it has a few disadvantages:

  • Requires some work if you need to change your data,
  • If the table is not temporary you need to not forget to drop it,
  • If your table is temporary it needs to be recreated after a reconnection,
  • If you don’t save the initialisation statements your test data is gone,
  • Especially with Hive, handling small tables has a lot of overhead.

It all works, but there is a nicer alternative: CTE + UDTF. Expanded, it means Common Table Expression with User Defined Table-generating Function.

Without further ado, here is an example, with the usual employees and departement:

with employee as(
  select inline(array(
      struct('Alice', '2017-03-04', 1)
    , struct('Bob', '2017-04-12', 1)
    , struct('Carol', '2018-12-24', 2)
  ))  as (name, start_date, dpt_id)
)
, department as (
  select inline(array(
      struct('IT', 1)
    , struct('Finance', 2)
  ))  as (name, id)
)
select
    e.name
  , e.start_date
  , d.name
from
  employee e
join
  department d
on
  e.dpt_id=d.id
;

And the result:

+---------+---------------+----------+
| e.name  | e.start_date  |  d.name  |
+---------+---------------+----------+
| Alice   | 2017-03-04    | IT       |
| Bob     | 2017-04-12    | IT       |
| Carol   | 2018-12-24    | Finance  |
+---------+---------------+----------+

So, what do we have here?

I define 2 common table expressions (with .. as () statement), which is a sort of run-time table. They can be used in any following CTE or queries. This table is defined by just giving the data we want in it (surrounded by inline(array(…)) as). Changing, adding, removing data is thus trivial and all is nicely bundled in one place.

Another nice thing is that these CTEs actually shadow real tables with the same name. This means that once you’re done testing, you just comment out the CTE definitions and the query will run with real data. This has the added benefit that you can always keep your test data with your actual query. You just need to uncomment the CTEs to use them.

Many other RDBMs (Mysql, Postgres, Oracle…) have CTEs. The UDTF (inline function) is less common, unfortunately.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s