Permanently remove a Kudu data disk

I needed to permanently remove a data disk from Kudu. In my case, this disk had way too many IOs and I needed to have Kudu not writing to it anymore. This post explains how to do this, safely.

Sanity checks

First, you need to make sure that there are no tables with replication factor 1. If by bad luck some tablet of this table are on the disk you will remove, then the table would become unavailable. Note that the user running this command must be in the superuser_acl  list of Kudu (replace of course ${kudu_master_host} with the real hostname).

kudu cluster ksck ${kudu_master_host} | grep '| 1 |' | cut -f2 ' '

If there are tables there, you need to

  • either DROP them
  • either recreate them with a higher replication factor. You cannot change the replication factor of an existing table.

Technically, there are other options, but they are trickier:

  • I could kudu tablet change_config move_replica tablets for all tables with RF 1 from eg. server 1 to server 2, then remove the directory for server 1, rebalance, then rinse and repeat from server 2 to 3 and so on. Note that you can only move tablet between servers, not disks, so if can take a while if you have many servers.
  • I could move the data directory from one disk to another disk as not whole disks are used by Kudu but only subdirectories. As all other disks already had Kudu data directories in my case, this would have meant that a disk would receive twice as many IOs.

Start a rebalance. After this the data will be properly spread, and more importantly we know that rebalance can happen.

kudu cluster rebalance ${kudu_master_host}

Stop kudu.

Remove a disk

Note: do this node per node! It should be possible to do 2 at a time, but I haven’t tested it. If you use Cloudera manager, you need to use config groups.

Remove the path to directory you want to remove from fs_data_dirs.
While kudu is still stopped, tell kudu on the tablet server which configuration you just changed, that there is now 1 less disk:

sudo -u kudu kudu fs update_dirs --force --fs_wal_dir=[your wal directory] --fs_data_dirs=[comma separated list of remaining directories]

Restart kudu. Data will be automatically rebalanced.

Congrats, go to your next node once all tablets are happy (kudu cluster ksck ${kudu_master_host} does not return any error).

Thousand separators in Hue

Hue is a very handy SQL assistant for Hadoop, where you can easily run Hive or Impala query.

I was asked if it was possible to have thousands separator in the display of the query results. There is no option in Hue, I thought I could get away with Django environment variables but either it’s not possible, or I got it wrong.

In any case, it did not feel great. You could argue that Hue is a display tool so it’s OK to format the output, but it would be for all users, and they might not all want that…

Long story short, I removed my Hadoop administrator cap and put on my dirty creative one. Once the results are loaded, it’s a trivial javascript manipulation to format them. Furthermore, Hue uses jquery which makes it even easier. So I came up with this little bookmarklet. Put it in the URL field of a bookmark, including the ‘javascript:’ prefix. If you want to format your output, just click on the bookmarklet et voilà:

    $('table.resultTable td').each(
        if (!isNaN($(this).text())) {
          $(this).text($(this).text().toString().replace(/(\d)(?=(\d{3})+(?!\d))/g, '$1,'))

Where exactly is this block on HDFS?

This post will show you how to find out where a specific hdfs block is: on which server and on which disk of this server.


I needed to decommission a directory from hdfs (updating This is not a big deal because the default replication factor is 3. Removing a disk would just trigger a rebalance.

Safety checks

Just for safety, I first wanted to check if all blocks were properly replicated. This is easy to check with the following command:

 hdfs fsck / -files -blocks -locations | grep repl=1 -B1

What does it do?

  • hdfs fsck /
    • run hdfs checks from the root
  •  -files -blocks -locations
    • Display file names, block names and location
  • | grep repl=1
    • show only blocks with replication 1
  • -B1
    • But please display the previous line as well to get the actual file name

If you’re good (all files are properly replicated) you would get an empty output. Otherwise, you get a bunch of those lines in the output:

/a/dir/a/file 2564 bytes, replicated: replication=1, 1 block(s): OK
0. BP-1438592571- len=2564 Live_repl=1 [DatanodeInfoWithStorage[,DS-f935a126-2226-4ef8-99a6-20d700f06110,DISK]]
/another/dir/another/file 2952 bytes, replicated: replication=1, 1 block(s): OK
0. BP-1438592571- len=2952 Live_repl=1 [DatanodeInfoWithStorage[,DS-1d065d48-f887-4ed5-be89-5e9c79633519,DISK]]

Technically, for me this was an error, which I could fix by forcing the replication to 3:

hdfs dfs -setrep 3 /a/dir/a/file

Where are my blocks?

In other words, are there unreplicated blocks on the disk I am about to remove?

There might be good reasons to have a replication factor of 1, and you then want to be sure that none of the blocks are on the disk you will remove.  How can you do that?

Looking at the output of the previous command, specifically the DatanodeInfoWithStorage bit, you can find out some interesting information already:,DS-1d065d48-f887-4ed5-be89-5e9c79633519,DISK
  • this is the server where the block is, 9866 is the default datanode port,
  • DISK: good, the data is stored on disk,
  • DS-1d065d48-f887-4ed5-be89-5e9c79633519: this looks like a disk ID. What does it mean?

Looking at the source on github does not help much: this is a string, named storageID. What now?

It turns out that this storage ID is in a text file on every directory listed in Look at one of those, you will find the file current/VERSION, which looks like:

#Tue Apr 07 13:49:10 CEST 2020

And there you are, there is the storageID, which matches what was displayed via the hdfs command.

This was the missing link to exactly know on which disk you block was.

Hive and integer overflows

I was playing around in Hive (3, from HDP) with big integers when I noticed something weird:

select 9223372036854775807 * 9223372036854775807;
-- 1

select 9223372036854775807 + 1;
-- -9223372036854775808

It turns out that Hive silently overflows integers. This comes from java, which does the same.

It’s lucky I noticed, it could have been very painful for me down the line. The workaround is to use big decimal. As the doc says:

Integral literals larger than BIGINT must be handled with Decimal(38,0). The Postfix BD is required.

For instance:

select 9223372036854775807BD * 9223372036854775807BD;
-- 85070591730234615847396907784232501249

select 9223372036854775807BD + 1;
-- 9223372036854775808

But it is still somewhat weird. Overflows with big decimals won’t error out but will return null:

select 9223372036854775807BD
  * 9223372036854775807BD
  * 9223372036854775807BD

Furthermore, if the precision is not 0 some behaviours are not consistent:

create temporary table dec_precision(
  d decimal(38, 18)
insert into dec_precision values
  , (98765432109876543210.12345)

select sum(d) from dec_precision;
-- NULL (but why?)
select sum(98765432109876543210.12345BD) from dec_precision;
-- 197530864219753086420.24690 (as expected)
select 98765432109876543210.12345BD + 98765432109876543210.12345BD;
-- 197530864219753086420.24690 (as expected)

Conversely, Mysql

select 9223372036854775807 * 9223372036854775807;
-- ERROR 1690 (22003): BIGINT value is out of range in '9223372036854775807 * 9223372036854775807'

or Postgres

select 2147483647 * 2147483647;
ERROR:  integer out of range

are a lot safer and friendlier in that regard.

Easy test data with Hive

Testing a query on a small dataset, especially if you need to carefully check your joins is usually made by creating a few temporary tables with hand-crafted data. This is a true and tested method, but it has a few disadvantages:

  • Requires some work if you need to change your data,
  • If the table is not temporary you need to not forget to drop it,
  • If your table is temporary it needs to be recreated after a reconnection,
  • If you don’t save the initialisation statements your test data is gone,
  • Especially with Hive, handling small tables has a lot of overhead.

It all works, but there is a nicer alternative: CTE + UDTF. Expanded, it means Common Table Expression with User Defined Table-generating Function.

Without further ado, here is an example, with the usual employees and departement:

with employee as(
  select inline(array(
      struct('Alice', '2017-03-04', 1)
    , struct('Bob', '2017-04-12', 1)
    , struct('Carol', '2018-12-24', 2)
  ))  as (name, start_date, dpt_id)
, department as (
  select inline(array(
      struct('IT', 1)
    , struct('Finance', 2)
  ))  as (name, id)
  , e.start_date
  employee e
  department d

And the result:

|  | e.start_date  |  |
| Alice   | 2017-03-04    | IT       |
| Bob     | 2017-04-12    | IT       |
| Carol   | 2018-12-24    | Finance  |

So, what do we have here?

I define 2 common table expressions (with .. as () statement), which is a sort of run-time table. They can be used in any following CTE or queries. This table is defined by just giving the data we want in it (surrounded by inline(array(…)) as). Changing, adding, removing data is thus trivial and all is nicely bundled in one place.

Another nice thing is that these CTEs actually shadow real tables with the same name. This means that once you’re done testing, you just comment out the CTE definitions and the query will run with real data. This has the added benefit that you can always keep your test data with your actual query. You just need to uncomment the CTEs to use them.

Many other RDBMs (Mysql, Postgres, Oracle…) have CTEs. The UDTF (inline function) is less common, unfortunately.

ATS server does not start

The newer versions of Hadoop, including HDP3, use HBase as the backend for the timeline service. You can either use an external HBase or have a system HBase running on Yarn (the default).

When using the system HBase, you could end up with the timeline server up and running, but with an alert (in Ambari) saying:

ATSv2 HBase Application The HBase application reported a ‘STARTED’ state. Check took 2.125s

The direct impact will be that Oozie jobs (among others) will take forever to run, as each step will wait for a timeout from the ATS (Application Timeline Server) before carrying on.

The solution I found to fix this is as follow:

    1. Check your yarn logs (/var/log/hadoop-yarn/yarn/ on hdp) for anything clear to spot, for instance, not enough yarn memory (and then fix it if relevant),
    2. Clean up hdfs ATS data as described on the HDP docs,
    3. Clean up zookeeper ATS data (the example here is for insecure clusters, you will probably have another znode for kerberised clusters): zookeeper-client rmr /atsv2-hbase-unsecure
    4. Restart *all* YARN services,
    5. Restart ambari server (we had a case where it looked like the alert was wrongly cached).
    6. Restart all services on the host where the ATS server lives.

The steps cleaning hdfs and zookeeper will make you lose your ATS history (ie. job names, timing, logs…), but your actual data is perfectly safe, nothing else will be lost.

Reaching Hive from pyspark on HDP3

There is a lot to find about talking to hive from Spark on the net. Sadly most of it refers to Spark before version 2 or are not valid for hdp3. You need to use the Hive Warehouse Connector, bundled in HDP3.

This is an example of a minimalistic connection from pyspark to hive on hdp3.

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

# Yes, llap even if you do not use it.
from pyspark_llap import HiveWarehouseSession

settings = [

conf = SparkConf().setAppName("Pyspark and Hive!").setAll(settings)
# Spark 2: use SparkSession instead of SparkContext.
spark = (
    # There is no HiveContext anymore either.

# This is mandatory. Just using spark.sql will not be enough.
hive = HiveWarehouseSession.session(spark).build()

hive.execute("select 2 group by 1 order by 1").show()

You then can run this with the following command:

SPARK_HOME=/usr/hdp/current/spark2-client \
spark-submit \
--jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly- \
--py-files /usr/hdp/current/hive_warehouse_connector/ \


  • HDP_VERSION is needed when you use python 3. If not set, HDP uses a script (/usr/bin/hdp-select) which is python 2 only (although fixing it is trivial).
  • PYSPARK_PYTHON is optional, it will default to just python otherwise (which might or might not be python 3 on your server)
  • without HADOOP_USER_NAME the script will run as your current user. Alternatively, you could sudo first.
  • without SPARK_HOME some jars would not be found and you would end up with an error like py4j.protocol.Py4JJavaError: An error occurred while calling
    : java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
  • –jars and –py-files as you can see, there is the hdp version in file names. Make sure you are using the proper one for you.
  • there is no –master option, this is handled in the script while building the SparkSession.

There is some doc from Hortonworks you can follow to go further: Integrating Apache Hive with Spark and BI.

Just before I posted this article, a new write-up appeared on to describe some use cases for the Hive-Warehouse-Connector.

ORC benchmarking

I need to use transactional tables in Hive, so I naturally use the ORC format. One limitation of those transactional tables is that they have to be bucketed (in Hive 2 at least. In Hive3 my understanding is that bucketing is not mandatory, but the default is equivalent to having 1 bucket).

It is tricky to find a good answer on how to use buckets: how many? on which columns? It usually boils down to:

Buckets are something that should be done for a concrete problem not just because you think you should have them. Normally I would not use them.

Which is not helpful when you actually have to use them.

So how to use buckets?

I did some benchmarking to try to get a valid answer. My dataset was as follow:

  • Data:
    • 1B rows table worth 3 months of data, with a compound (logical) key: user id, and 4-level hierarchical identifier. This is about 15GB (unreplicated) on disk.
    • 9 merges of 1 hour of data
  • Benchmarks:
    • select count(*)
    • select count(*) where  hierarchy_level_1=something
    • select 1 specific row
    • big select with join on 2 other tables, resulting in 400k rows.

All benchmarks were run 5 times, doing one iteration for all tables before starting the next iteration to prevent caching.

I tested a few variations:

  • transactional table or not,
  • partitioned or not (partition was the highest level of my hierarchical key. About 200 hundred partitions, somewhat skewed),
  • 1 to 32 buckets,
  • bloom filters on levels #1 and/or #2  my hierarchical key and/or on the user id.
  • I added as extra test some non-transactional tables, sorted on different fields.

It is tricky to properly display the data here so I show it as an image (click to enlarge), but you can find it as a pdf: ORC Benchmark – results or see it on google sheets.


What gives?

I had some expectations:

  • More buckets would mean more files so faster load and faster queries up to a point.
  • Bucket + partitions would create way too many files, at the cost of performance.
  • Proper bloom filters would make everything faster.

Well, looking at the results, here are the takeaways:

  • Bloom filters have basically no impact at all,
  • no partition usually helps,
  • 1 bucket is horrible (but that I did expect),
  • too many buckets and partition together is bad,
  • the sweet spot (for me) is partition +  4 buckets.

I am very disappointed because I had a lot of hopes for bloom filters (I guess there is something I am doing wrong). The rest is more a confirmation of my expectations, with numbers to prove them.

The worst of all? My sweet spot is the one I am already using, so after all those tests (which ran almost 3 days straight) I have no new information to speed up my queries. Well, the silver lining is that at least I have data to show that I can pat myself on the back.

Cluster on which I tested: hdp 2.6, hive 2, 8 data nodes (8 CPU, 32GB) on AWS, doing nothing else apart from running these tests.

Oops, I dropped my Hive Metastore database

The Hive Metastore (HMS) is backed up by a database (MySQL for us), which stores information about HDFS files, stats and more. Without this database, you have no Hive.

A few days ago, a DROP has been issued on the wrong server. Bye Bye metastore. What do you do then?

First things first, I looked at the backups. We had one from 3 hours before, so still quite recent. After it was restored I noticed that Hive worked, but not all data was present visible.

For context, our tables are ORC and transactional. As HDFS files cannot be updated, the way for ORC (and other DBs I know of) to manage transactions is to have a base directory as well as some deltas holding new changes. Reading such a table thus means reading the base directory and all the deltas to apply potential updates. As reading the deltas can become expensive, they are eventually compacted, in 2 possible ways. Quite often the deltas are squashed together (minor compaction for ORC) and once in a while, the base directory is fully rewritten to apply all the deltas (major compaction for ORC).

In my case, between the backup and the restore some new data had been added (new delta directories appeared) but no compaction happened.

This means that the HMS was completely ignoring all the new delta files, but was happily using all the files it knew about. I should add that I have a way to replay data of the last hours/days if needed. Once this was understood and confirmed, the fix was easy:


  1. Run a major compaction on all tables
    • All tables ended up consisting of one base directory (used and known by HMS) and a few delta directories ignored by HMS.
  2. Delete all remaining deltas
    • They were not used anyway.
  3. Replay my data
    • Hive was up to date again.

Of course, my tables are partitioned and compaction happens per partition so there was a bit of bash-fu to explicitly compact all partitions, then double check that all remaining deltas were created after the backup. Nothing too complex.

This worked because no compactions ran between restore and backup. A compaction would completely change the files on disk, and the HMS metadata would thus be completely out of sync with the actual files. I have no idea how I would have recovered that one.

On a side note, Hive hanged a few times with this error message in the log:

” (“NL_ID”, “EVENT_ID”, “EVENT_TIME”, “EVENT_TYPE”, “DB_NAME”, “TBL_NAME”, “MESSAGE”, “MESSAGE_FORMAT”) values(774946,774869,1543829236,’OPEN_TXN’,’null’,’ ‘,'{“txnIds”:null,”timestamp”:1543829236,”fromTxnId”

metastore.RetryingHMSHandler ( – MetaException(message:Unable to execute direct SQL java.sql.SQLInte
grityConstraintViolationException: Duplicate entry ‘774946’ for key ‘PRIMARY’

The fix was quite easy. Log into the metastore DB, remove the offending row from NOTIFICATION_LOG, and update the sequence (NOTIFICICATION_LOG_SEQUENCE table) to the maximum value of NL_ID in NOTIFICATION_LOG + 1. The NOTIFCATION_LOG table has something to do with compactions, and as you compacted all tables anyway old values do not matter much.

What is the takeaway? Compactions are run dynamically by Hive. It would be smart to keep an eye on them, and when some happened to run a backup just after. At least you would still be able to partially recover consistent recent enough data.

Newlines in Hive

I have a nice Hive warehouse, running data exports for some clients. All was working smoothly when a coworker came to me because a client complained that there were NULLs in an export, whereas they should all be sanitised via a lot of coalesce.

The reason ended up being newlines which I thought were properly handled, but as it turned out were not in all cases. This post explains what the issues are and how to fix them.

This is an example of what happened (simplified query to show the problem). A json string with newlines, exploded with a lateral view. Newlines are escaped so all should have worked.

with things as (
        42 as id
      , '{   "item": "cheese"
           , "good": "camembert\\ncomté"
           , "tasty": "yes"
         }' as custom_fields
  , c.custom_fields
  , cv.good
  , cv.item
  , cv.tasty
    (select * from things order by id) c
lateral view
    json_tuple(c.custom_fields, "good", "item", "tasty") cv
    as good, item, tasty

If you run this code you might (depending on your setup) see a nice \n in custom_fields but an actual line breaks in good which is not what you want.

Note that if you replace the from clause by just from things c the issue will not appear.

The problem is that in my example an intermediary map step is executed (because of the order by). Data in this step was stored by default as textfile, where the row delimiter is, you can guess, \n.

With the simplified from clause, there is no intermediary map step and the problem does not happen.

What we would like is to have the intermediary data stored as some sort of binary format, where \n is not a separator. Luckily, there’s a setting for that: hive.query.result.fileformat. The default (up to Hive 2.0) was TextFile, which as you can guess will use newlines as row separator. From Hive 2.1 onward, the default became SequenceFile which is a binary format, compressible, which does not have the newlines issue.

You can either set this setting globally in hive-site, or add it before your query with set hive.query.result.fileformat=SequenceFile;.

With Hive 2, this fixes the newlines, but the \n completely disappear from the output. This is better than previously but not ideal.
With Hive 3, if you are using beeline, you can add the command line option --escapeCRLF=true which will give you exactly the output you want.