This post will show you how to find out where a specific hdfs block is: on which server and on which disk of this server.
I needed to decommission a directory from hdfs (updating dfs.datanode.data.dir). This is not a big deal because the default replication factor is 3. Removing a disk would just trigger a rebalance.
Just for safety, I first wanted to check if all blocks were properly replicated. This is easy to check with the following command:
hdfs fsck / -files -blocks -locations | grep repl=1 -B1
What does it do?
hdfs fsck /
- run hdfs checks from the root
-files -blocks -locations
- Display file names, block names and location
| grep repl=1
- show only blocks with replication 1
- But please display the previous line as well to get the actual file name
If you’re good (all files are properly replicated) you would get an empty output. Otherwise, you get a bunch of those lines in the output:
/a/dir/a/file 2564 bytes, replicated: replication=1, 1 block(s): OK 0. BP-1438592571-10.88.112.28-1502096897275:blk_1077829561_4348908 len=2564 Live_repl=1 [DatanodeInfoWithStorage[10.1.2.3:9866,DS-f935a126-2226-4ef8-99a6-20d700f06110,DISK]] -- /another/dir/another/file 2952 bytes, replicated: replication=1, 1 block(s): OK 0. BP-1438592571-10.88.112.28-1502096897275:blk_1077845856_4366930 len=2952 Live_repl=1 [DatanodeInfoWithStorage[10.2.3.4:9866,DS-1d065d48-f887-4ed5-be89-5e9c79633519,DISK]]
Technically, for me this was an error, which I could fix by forcing the replication to 3:
hdfs dfs -setrep 3 /a/dir/a/file
Where are my blocks?
In other words, are there unreplicated blocks on the disk I am about to remove?
There might be good reasons to have a replication factor of 1, and you then want to be sure that none of the blocks are on the disk you will remove. How can you do that?
Looking at the output of the previous command, specifically the DatanodeInfoWithStorage bit, you can find out some interesting information already:
- 10.2.3.4:9866 this is the server where the block is, 9866 is the default datanode port,
- DISK: good, the data is stored on disk,
- DS-1d065d48-f887-4ed5-be89-5e9c79633519: this looks like a disk ID. What does it mean?
Looking at the source on github does not help much: this is a string, named storageID. What now?
It turns out that this storage ID is in a text file on every directory listed in dfs.datanode.data.dir. Look at one of those, you will find the file current/VERSION, which looks like:
#Tue Apr 07 13:49:10 CEST 2020
And there you are, there is the storageID, which matches what was displayed via the hdfs command.
This was the missing link to exactly know on which disk you block was.