67

How does someone fix a HDFS that's corrupt? I looked on the Apache/Hadoop website and it said its fsck command, which doesn't fix it. Hopefully someone who has run into this problem before can tell me how to fix this.

Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures.

When I ran bin/hadoop fsck / -delete, it listed the files that were corrupt or missing blocks. How do I make it not corrupt? This is on a practice machine so I COULD blow everything away but when we go live, I won't be able to "fix" it by blowing everything away so I'm trying to figure it out now.

0

4 Answers 4

108

You can use

  hdfs fsck /

to determine which files are having problems. Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so I normally get down to the meaningful output with

  hdfs fsck / | egrep -v '^\.+$' | grep -v eplica

which ignores lines with nothing but dots and lines talking about replication.

Once you find a file that is corrupt

  hdfs fsck /path/to/corrupt/file -locations -blocks -files

Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.

You can use the reported block numbers to go around to the datanodes and the namenode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, datanode not running, file system reformatted/reprovisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.

Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.

Once you determine what happened and you cannot recover any more blocks, just use the

  hdfs fs -rm /path/to/file/with/permanently/missing/blocks

command to get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.

9
  • 7
    Thx for your reply. I'll try your suggestion the next time the HDFS has issues. Somehow, it fixed itself when I ran bin/hadoop fsck / -delete. After that, the HDFS was no longer corrupted and some files ended up in /lost+found. It didn't do that before when I stopped the HDFS and restarted several times. I upvoted and accepted your answer =) Thx again.
    – Classified
    Commented Oct 14, 2013 at 20:19
  • 16
    But if a file is replicated 3 times in the cluster, can't I just get it back from another node? I know I had some data loss on one machine, but isn't the whole point of HDFS that this shouldn't matter? Commented Aug 5, 2014 at 19:44
  • 1
    Having had a problem with only one node (it crashed and got some of its files lost), the easiest solution was the one suggested by @Classified, simply execute hadoop fsck / -delete
    – sofia
    Commented Jul 21, 2016 at 8:43
  • 4
    Wouldn't deleting the missing blocks cause data loss? hdfs fs -rm /path/to/file/with/permanently/missing/blocks @mobileAgent Commented Apr 18, 2018 at 22:13
  • 2
    There are often times when applications write intermediate data that is temporary stuff that can easily be re-generated on failure and therefore are stored with a replication factor of 1. If these types of applications crash for any reason and do not clean up, they will leave behind this data. If at some point in the future the DataNode(s) with the one replica crashes, you will see corrupt blocks. This happens every so often and isn't a big deal. This data can safely be removed to restore the health of the cluster.
    – davidemm
    Commented Jul 2, 2020 at 20:25
33

If you just want to get your HDFS back to normal state and don't worry much about the data, then

This will list the corrupt HDFS blocks:

hdfs fsck -list-corruptfileblocks

This will delete the corrupted HDFS blocks:

hdfs fsck / -delete

Note that, you might have to use sudo -u hdfs if you are not the sudo user (assuming "hdfs" is name of the sudo user)

2

the solution here worked for me : https://community.hortonworks.com/articles/4427/fix-under-replicated-blocks-in-hdfs-manually.html

su - <$hdfs_user>

bash-4.1$ hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files 

-bash-4.1$ for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ;  hadoop fs -setrep 3 $hdfsfile; done
1
  • I also had to flip my primary name node before I ran the above commands because it had entered SAFE MODE. Flipping set made the stand by node to become Active and I could run the above commands and got rid of corrupt blocks :)
    – abc123
    Commented Jul 19, 2018 at 21:42
-7

start all daemons and run the command as "hadoop namenode -recover -force" stop the daemons and start again.. wait some time to recover data.

Not the answer you're looking for? Browse other questions tagged or ask your own question.