0

This was an unusual failure not exactly similar to typical cases, it started when a simple folder rename on disk completely froze responsible process with later stable symptoms:

  • When HDD was connected directly via SATA:
    • BIOS delayed boot for about 10 seconds and this disk was not visible any more, i.e. disk existence was not recognized.

Since originally this was external HDD case which is connected via USB, I had different options to connect it:
1) directly via SATA
2) standard intended SATA -> internal case adapter to Micro-B USB 3.0 -> USB 3.0
3) with case removed, direct SATA -> USB 3.0 adapter (seems same to 2)

  • When connected via USB, i.e. 2) or 3),
    • if connected during boot, PC freeze on boot (until unplugged)
      if connected on shutdown (only after hang state?), PC does not finally turn off (until unplugged)
    • if connected the standard way, i.e external USB on live Windows:
      Windows shows a disk need repair notification, NTFS partitions are recognized.
      But chkdsk freezes almost immediately, programs which access disk sometimes can see root folders, but freeze on reading their contents.

After the first freeze, it is visible by the case indicator that the disk starts to be completely unresponsive until it unplugged and plugged again.
All freezes are recoverable with unplug and do not hang PC completely, but only the involved process.
Seems disk defragment showed 100% fragmentation at one point before failure, but may be just defrag bug.
Failure was stable, reproducible on other PC's.

What is an optimal recovery path?
In my specific case, spin-down solution (posted below) solved the recovery, but what is the underlying cause of such freeze on read?
For example, could the next recovery step be extracting disk to other case or replacing controller plate?

2 Answers 2

2

Disk data is usually recoverable with such symptoms.
To explain the freeze on read we need to see what happens:

At drive level

most likely cause for such a freeze is the drive having trouble reading data from the disk surface. A modern drive is programmed to try to recover the data from problematic sectors and not to give up easily.

Here is an example of error recovery procedure:

The drive retries data field read operations in the following sequence if it detects an error which cannot be corrected on-the-fly. The default retry algorithm repeats eight times for a total of 128 retries or until the data is recovered. - This takes time.

1. Initial read
2. First retry
3. Read retry with data threshold offset +1
4. Read retry with data threshold offset -1
5. Read retry with data window offset +1
6. Read retry with data window offset -1
7. Write Spash
8. Read retry with data threshold offset +2
9. Read retry with data threshold offset -2
10. Read retry with data window offset +2
11. Read retry with data window offset -2
12. Normal read retry
13. Read retry with servo offset +8%
14. Read retry with servo offset -8%
15. Normal read retry
16. Software (2-burst) EDAC correction attempt

source: http://www.hddoracle.com/viewtopic.php?f=18&t=1133

If the drive is successful in reading the sector data at some point, it may decide to reallocate the sector. - This takes time.

At OS level

Once the drive went through the above error recovery and assuming it was unsuccessful, Windows will retry reading the data as much as 9 times. This means the above process (drive level recovery) will repeat itself 9 times.

source: https://www.deepspar.com/blog/Software-Recovery-Attempts-2.html

It can also happen that Windows decides it's waiting too long and decides to 'drop the drive' (it will no longer be displayed in Disk Management).

The symptoms described are most likely due to surface problems and possibly firmware. Swapping the drive's PCB is useless in this case.

Best approach data recovery

  • If possible reduce number of abstraction layers. For example, if you're dealing with an external USB hard drive it may be possible to take the drive from it's enclosure and attach it to a SATA port directly. Reason being that many USB bridges handle errors very poorly. The more directly you can talk with the drive, the better. In case the drive is native USB, a data recovery specialist can convert the drive to SATA and hook it up to his specialized hardware (for example PC3000).
  • Initially avoid problematic areas and copy the easy to get data first. A data recovery specialist will use specialized hardware imagers. These for example allow the interruption of the above drive level recovery procedure which may be desired during a first pass. Such tools can also automatically detect a drive has stopped responding and power-cycle the drive. HDDSuperClone is an open source tool that works following this principle using a multi-pass strategy and it can even be configured to control a relay for automatic power-cycling. Another tool, but less advanced is ddrescue. While you may be tempted to approach the drive at the file system level (copy file by file), this may cause additional stress to the drive which is to be avoided when dealing with a potentially unstable drive. Try to make every read count.
  • Extract data from the disk image or clone produced by HDDSuperClone or ddrescue. Depending on amount of bad sectors and damage to file system data you may need to use a file recovery tool to virtually reconstruct the file system.

In addition to the above a data recovery specialist can manipulate the drive's firmware: It may for example be desired to disable the drive's automatic reallocation algorithms.

The general idea behind best approach data recovery

A data recovery specialist will assume the worst, and also that from this point on the drive will further deteriorate. IOW, we have a limited number of reads left so we have to make each read count.

All the error recovery steps described above could be regarded reads, while the chance the drive will be able to read data from the trouble sector eventually is uncertain. So, this is fundamental to idea to try reading good data first while we as much as possible avoid potentially non productive reads. This is one of the reasons I advise against a tool like SpinRite, it will try 1000 times to read a bad sector. That's 1000 * the drive's internal error recovery procedure!

Since a drive is largely a back box, even if have access to advanced data recovery hardware like AceLab's PC3000 or DeepSpar's tools we have to rely on actual drive behavior we can observe and measure.

For example, a drive reports various states such as BUSY or READY, and it will tell us if an error occurred during the last command we sent. So if we read a sector (we hardly ever read just one sector, but to keep things simple) we can observe:

  • Time it took for the drive to respond with either data or an error
  • In case we get an error the drive may reveal what type of error

If the drive stops responding, it may be purposely or the drive's firmware may have simply 'crashed'.

  • In case purposely the drive may set the device fault bit
  • In case it stops responding and fault bit wasn't set we know it crashed.

In other occasions the drive may take a long time to respond and yet give us the data without reporting anomalies.

  • Drive stays BUSY for long time
  • But eventually delivers data without reporting errors

In this case it's likely the firmware deems it has more important things to handle than giving us the data from the sector we're trying to read. An example of this is the so called 'WD slow response bug' where we read a bad sector at some point and the drive keeps trying to reallocate this even when we want it to do other things.

Now for argument's sake I'll assume we do not have tools to modify the firmware to for example make it stop trying to reallocate sectors.

Then what can we do to prevent the drive from do excessive re-reads, reallocations and such which all nibble away from our precious remaining reads? The answer are resets and power-cycles. Power-cycles are only a last resort as they themselves 'stress' the drive (in case of mechanical drives).

  • Since 'bad areas' or bad sectors often occur in clusters, because for example a head crashed into the drive's surface we skip an n number of sectors after each unreadable sector and we keep track of the unreadable and skipped sectors so we can address them during next passes. So pass 1 we try to void any problem areas.
  • To prevent the drive from excessive retries we reset the drive after m ms, skip n sectors and try again. To determine a good time-out value we examine how long on average it takes to read a good sector from the drive. For example, in the case of this (flash) drive I took time it takes to read a good sector, add some and settled for a read time-out of 200 ms. enter image description here
  • If drive fails to respond to resets we power-cycle the drive
  • If we suspect and determined the drive's firmware rather than the sector's we're trying to read are the issue, we can attempt a reset / power-cycle and immediately try reading the same sector again. Often this way we can read a stretch of sectors before the firmware gets per-occupied with it's background taks again. Here's an example of that: enter image description here Of course during imaging we'll not wait 40 seconds for the drive to disconnect, we send our reset / power-cycle much quicker.

In case of a multi-pass strategy we can try skipped sectors during a next pass, and after that a pass to try to recover data from confirmed bad sectors (those that resulted in an error).

Don't!

  • run chkdsk
  • run disk repair tools such as SpinRite or HDD Regenerator
  • freeze the drive
4
  • In the answer, there is an example of error recovery procedure with steps 1. Initial read ... to ... 16. Software (2-burst) EDAC correction attempt. Is it possible to see this steps are happening when disk is connected or to see (for example, somewhere in SMART) that these steps happened on previous plug, and their order?
    – halt9k
    Commented May 18, 2023 at 22:18
  • 1
    This is happening internally and drive will just return BUSY state for the duration of the process. As I mentioned it may be desired to interrupt the process by either RESET if possible or a power-cycle. To determine if to do this and how soon you take typical time required for reading a good sector (you test this) + say 25%. If read takes longer you try a reset and move on while saving current sector for a second pass. Is this what you mean? If so I can add it to answer. Commented May 18, 2023 at 22:29
  • Yes, that would be good to mention that BUSY state. Point is, you wish to distinguish between dead freeze of firmware or repair procedure is happening. For example, I don't remember any sound from HDD at all after freezes, while probably during repair procedure the usual head move "clicks" are expected?
    – halt9k
    Commented May 18, 2023 at 22:36
  • hard drives and also SSD's are largely black boxes. error recovery procedures in essence are repeated reads so they sound as such too. I will add some info. Commented May 18, 2023 at 22:50
0

For the recover data part of question, this may or may not help everyone with the same symptoms, but either

  • viewing SMART with disk manufacturer tools
  • slowing down disk spin option

helped me to read all data normally without getting same freeze again. In my case, that was Seagate and used their free tools, option Advanced -> spin down.

Additional notes:
  • I had no critical data on that disk and irrecoverable failure was acceptable.
  • Originally before freeze, this was external USB HDD. After freeze I disassembled case and attached via direct SATA -> USB cable, not with original internal adapter.
    Probably this would be an unrecoverable case if no alternatives to SATA are available. I.e. this USB approach may help when it's originally internal SATA disk which is simply not identified after BIOS.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .