Disk data is usually recoverable with such symptoms.
To explain the freeze on read we need to see what happens:
At drive level
most likely cause for such a freeze is the drive having trouble reading data from the disk surface. A modern drive is programmed to try to recover the data from problematic sectors and not to give up easily.
Here is an example of error recovery procedure:
The drive retries data field read operations in the following sequence if it detects an error which cannot be corrected on-the-fly. The default retry algorithm repeats eight times for a total of 128 retries or until the data is recovered. - This takes time.
1. Initial read
2. First retry
3. Read retry with data threshold offset +1
4. Read retry with data threshold offset -1
5. Read retry with data window offset +1
6. Read retry with data window offset -1
7. Write Spash
8. Read retry with data threshold offset +2
9. Read retry with data threshold offset -2
10. Read retry with data window offset +2
11. Read retry with data window offset -2
12. Normal read retry
13. Read retry with servo offset +8%
14. Read retry with servo offset -8%
15. Normal read retry
16. Software (2-burst) EDAC correction attempt
source: http://www.hddoracle.com/viewtopic.php?f=18&t=1133
If the drive is successful in reading the sector data at some point, it may decide to reallocate the sector. - This takes time.
At OS level
Once the drive went through the above error recovery and assuming it was unsuccessful, Windows will retry reading the data as much as 9 times. This means the above process (drive level recovery) will repeat itself 9 times.
source: https://www.deepspar.com/blog/Software-Recovery-Attempts-2.html
It can also happen that Windows decides it's waiting too long and decides to 'drop the drive' (it will no longer be displayed in Disk Management).
The symptoms described are most likely due to surface problems and possibly firmware. Swapping the drive's PCB is useless in this case.
Best approach data recovery
- If possible reduce number of abstraction layers. For example, if you're dealing with an external USB hard drive it may be possible to take the drive from it's enclosure and attach it to a SATA port directly. Reason being that many USB bridges handle errors very poorly. The more directly you can talk with the drive, the better. In case the drive is native USB, a data recovery specialist can convert the drive to SATA and hook it up to his specialized hardware (for example PC3000).
- Initially avoid problematic areas and copy the easy to get data first. A data recovery specialist will use specialized hardware imagers. These for example allow the interruption of the above drive level recovery procedure which may be desired during a first pass. Such tools can also automatically detect a drive has stopped responding and power-cycle the drive. HDDSuperClone is an open source tool that works following this principle using a multi-pass strategy and it can even be configured to control a relay for automatic power-cycling. Another tool, but less advanced is ddrescue. While you may be tempted to approach the drive at the file system level (copy file by file), this may cause additional stress to the drive which is to be avoided when dealing with a potentially unstable drive. Try to make every read count.
- Extract data from the disk image or clone produced by HDDSuperClone or ddrescue. Depending on amount of bad sectors and damage to file system data you may need to use a file recovery tool to virtually reconstruct the file system.
In addition to the above a data recovery specialist can manipulate the drive's firmware: It may for example be desired to disable the drive's automatic reallocation algorithms.
The general idea behind best approach data recovery
A data recovery specialist will assume the worst, and also that from this point on the drive will further deteriorate. IOW, we have a limited number of reads left so we have to make each read count.
All the error recovery steps described above could be regarded reads, while the chance the drive will be able to read data from the trouble sector eventually is uncertain. So, this is fundamental to idea to try reading good data first while we as much as possible avoid potentially non productive reads. This is one of the reasons I advise against a tool like SpinRite, it will try 1000 times to read a bad sector. That's 1000 * the drive's internal error recovery procedure!
Since a drive is largely a back box, even if have access to advanced data recovery hardware like AceLab's PC3000 or DeepSpar's tools we have to rely on actual drive behavior we can observe and measure.
For example, a drive reports various states such as BUSY or READY, and it will tell us if an error occurred during the last command we sent. So if we read a sector (we hardly ever read just one sector, but to keep things simple) we can observe:
- Time it took for the drive to respond with either data or an error
- In case we get an error the drive may reveal what type of error
If the drive stops responding, it may be purposely or the drive's firmware may have simply 'crashed'.
- In case purposely the drive may set the device fault bit
- In case it stops responding and fault bit wasn't set we know it crashed.
In other occasions the drive may take a long time to respond and yet give us the data without reporting anomalies.
- Drive stays BUSY for long time
- But eventually delivers data without reporting errors
In this case it's likely the firmware deems it has more important things to handle than giving us the data from the sector we're trying to read. An example of this is the so called 'WD slow response bug' where we read a bad sector at some point and the drive keeps trying to reallocate this even when we want it to do other things.
Now for argument's sake I'll assume we do not have tools to modify the firmware to for example make it stop trying to reallocate sectors.
Then what can we do to prevent the drive from do excessive re-reads, reallocations and such which all nibble away from our precious remaining reads? The answer are resets and power-cycles. Power-cycles are only a last resort as they themselves 'stress' the drive (in case of mechanical drives).
- Since 'bad areas' or bad sectors often occur in clusters, because for example a head crashed into the drive's surface we skip an n number of sectors after each unreadable sector and we keep track of the unreadable and skipped sectors so we can address them during next passes. So pass 1 we try to void any problem areas.
- To prevent the drive from excessive retries we reset the drive after m ms, skip n sectors and try again. To determine a good time-out value we examine how long on average it takes to read a good sector from the drive. For example, in the case of this (flash) drive I took time it takes to read a good sector, add some and settled for a read time-out of 200 ms.
- If drive fails to respond to resets we power-cycle the drive
- If we suspect and determined the drive's firmware rather than the sector's we're trying to read are the issue, we can attempt a reset / power-cycle and immediately try reading the same sector again. Often this way we can read a stretch of sectors before the firmware gets per-occupied with it's background taks again. Here's an example of that:
Of course during imaging we'll not wait 40 seconds for the drive to disconnect, we send our reset / power-cycle much quicker.
In case of a multi-pass strategy we can try skipped sectors during a next pass, and after that a pass to try to recover data from confirmed bad sectors (those that resulted in an error).
Don't!
- run chkdsk
- run disk repair tools such as SpinRite or HDD Regenerator
- freeze the drive