Non-specific kernel errors with new hard drive, is the drive failing?

Question

I am running CentOS 7.7.1908 on a server with a RAID5 array (using mdadm software RAID). The array consists of four 4 TB drives. I recently replaced some of the drives with brand new WD Red drives. All was well for a little over a week, until I woke up one morning to a "fail" event. It seems one of the new drives (/dev/sda) had been marked as failed and dropped out of the array.

I ran a short SMART self-test and the drive checked out ok. There were no other errors noted in the SMART log for the drive, so I added it back into the array. The array resynced successfully and everything appears ok. But since nothing precipitated the fail event, I'm concerned there might be something wrong with the drive.

Below are the syslog messages from the time the drive was "failed" out of the array:

Apr  9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr  9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr  9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr  9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr  9 03:34:11 server kernel: blk_update_request: I/O error, dev sda, sector 2056
Apr  9 03:34:11 server kernel: md: super_written gets error=-5, uptodate=0
Apr  9 03:34:11 server kernel: md/raid:md127: Disk failure on sda1, disabling device.#012md/raid:md127: Operation continuing on 3 devices.
Apr  9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr  9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr  9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr  9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr  9 03:38:50 server kernel: blk_update_request: I/O error, dev sda, sector 0
Apr  9 03:38:51 server kernel: mpt2sas_cm0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)

It's difficult to figure out exactly what happened since the error states there is "no additional sense information." However, after the resync completed I decided to run an extended SMART test on the drive. I started it yesterday afternoon and it was progressing fine...until I woke up this morning.

It has apparently been sitting at "10% of test remaining" all night, so I assume something isn't working properly. I also noted that the SMART information for this drive indicates that the "extended self-test routine recommended polling time" is 497 mins, while the time for the other drives in the array -- which are identical in make and model -- is approx. 205 mins.

So...perhaps this is a defective drive that has errors SMART doesn't record? Or is there something else that might be going on? Has anyone seen something like this before? Any help would be appreciated. Thanks!

Update: More Info

As requested, here is the output of smartctl for the drive in question

[user@localhost]~% sudo smartctl -a /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1062.18.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EFAX-68JH4N0
Serial Number:    WD-XXXXXXXXXXXX
LU WWN Device Id: 5 0014ee 2bce22f9d
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Apr 10 11:02:15 2020 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 241) Self-test routine in progress...
                                        10% of test remaining.
Total time to complete Offline
data collection:                (23324) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 497) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3039) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       205
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   114   107   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       177         -
# 2  Extended offline    Interrupted (host reset)      10%       108         -
# 3  Short offline       Completed without error       00%         0         -
# 4  Conveyance offline  Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Update: Even More Info

Per the next suggestions from @dirkt, I tried to read from the sectors noted in the original syslog errors:

[user@localhost]~% sudo dd bs=512 if=/dev/sda1 of=./sector0-sda1.txt skip=0 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00244528 s, 209 kB/s

[user@localhost]~% sudo dd bs=512 if=/dev/sda1 of=./sector2056-sda1.txt skip=2056 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00281374 s, 182 kB/s

This is not something I'm terribly familiar with, but I assume this means the reads were successful? The file for sector 0 is empty, and the file for sector 2056 contains some gibberish. Should I attempt to write to them? Edit: I should probably add -- the SMART info remains the same after the reads. No errors logged, and the extended test is still at "10% remaining."

Update #3

Since it looks like I can read those sectors, it seems like they are ok. After reading (per above) them there were no updates in the SMART log:

[user@localhost]~% sudo smartctl -a /dev/sda
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       252
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   111   107   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

So I added the drive back into the array. The resync was successful, and the error has not recurred yet. So maybe it's ok?

[user@localhost]~% cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdb1[7] sdc1[5] sdd1[4] sda1[6]
      11721047040 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

One new thing I noticed: per my note below regarding extended self-tests, I tried doing selective self-tests by specifying smartctl -t select,0-max /dev/sdX. Per the workaround below, this should mimic a long test but will provide a more detailed progress indicator. I ran this selective test on every drive, since the long tests were stuck at 10% remaining on every drive for days. For the 3 "good" drives in the array, the selective test completed without error within a reasonable amount of time (several hours, but less than a day). The selective test on the "questionable" drive (/dev/sda) is taking much longer. It says 10% remaining like before, but the progress indicator is more useful:

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA     MAX_LBA  CURRENT_TEST_STATUS
    1        0  7814037167  Self_test_in_progress [10% left] (5010947864-5011013399)
    2        0           0  Not_testing
    3        0           0  Not_testing
    4        0           0  Not_testing
    5        0           0  Not_testing

It has been running for approx. 12 hours at this point. It's going very slowly (especially compared to the other drives) but still seems to be progressing. I will post an update when it finishes (if it finishes)...Edit: the selective self test has finally finished and it completed without error. So I guess that means all is well?

Update #4: The Return

Everything was working fine for the past week. Unfortunately this afternoon the same drive dropped out of the array again. The same errors popped up in the syslog:

Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr 14 18:07:38 xenon kernel: blk_update_request: I/O error, dev sda, sector 2056
Apr 14 18:07:38 xenon kernel: md: super_written gets error=-5, uptodate=0
Apr 14 18:07:38 xenon kernel: md/raid:md127: Disk failure on sda1, disabling device.#012md/raid:md127: Operation continuing on 3 devices.
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr 14 18:08:50 xenon kernel: blk_update_request: I/O error, dev sda, sector 0
Apr 14 18:08:51 xenon kernel: mpt2sas_cm0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)

After these errors I received the notification from mdadm:

[user@localhost]/var/log# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdb1[7] sdc1[5] sdd1[4] sda1[6](F)
      11721047040 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]

unused devices: <none>

I have started a selective SMART test on /dev/sda but since none of the previous tests turned up anything wrong I'm not optomistic. Is there any way to tell if this is a bad drive or if it's the drive controller that's going bad? Since the same drive dropped in both cases, I'm inclined to think it's the drive, but does someone know how to decode the errors in the logs? Happy to provide more info. Thanks!

Update #5: The Saga Continues

For those who are following things, here is the latest:

Since I had an old enclosure, I put my original 2 TB drives in it and quickly created a new "spare" array.
I copied the contents of the 4 TB drive array to the spare array.
I deleted the original array and created a new RAID10 array using the 4 TB drives (based on a variety of searches, it seems that RAID5 with large drives, particularly 4 or more, does not actually offer great performance or redundancy).
The new array initialized successfully. I copied the original data from the 2 TB drive spare array to the new 4 TB drive RAID10 array.
Based on the discussions with @dirkt below (who is awesome, btw), I have disabled NCQ on each of the 4TB drives via echo 1 > /sys/block/sdX/device/queue_depth. This is an effort to both reduce the complexity/parallelism of the array and because there are some discussions that indicate NCQ may actually be bad for RAID performance. I am letting the array run with this temporary fix to see if it resolves the issue.
Based on a tip from Mike Uchima at the Ars Technica comment boards (original post here), I also set the noatime mount option for the array's file system (which is not set by default in ext4 file systems). Per the comment board discussion, updating last-access times might overwhelm the SMR logic in the drive, ultimately leading to a drive being dropped.
I will post updates if the "faulty" drive (or another drive) drops out of the array again.

In addition, a number of media outlets have started reporting on some deceptive marketing practices by major hard drive manufacturers, including Western Digital (an example is linked here). It appears they are using Shingled Magnetic Recording (SMR) in several models of their Red drives without labeling or advertising them as such, even though SMR has been known to cause problems with NAS and RAID configurations (ironically, some of the issues with SMR are mentioned here in WD's own materials, where they point out that drive-managed SMR is bad for parallel operations...like RAID). This is obviously a problem since the Red drives are specifically marketed for NAS and RAID purposes.

The model of 4 TB drive I purchased is suspected of being among the models of drives that use SMR (model WD40EFAX). Based on the news articles, EFAX models with 256 MB of cache (like mine) are likely to use SMR. Using hdparm -I shows that my drives support TRIM, which is apparently another indicator that a drive uses SMR:

[user@localhost]~% sudo hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
    Model Number:       WDC WD40EFAX-68JH4N0
...
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, with device specific minimum
    R/W multiple sector transfer: Max = 16  Current = 16
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4 
         Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
    Enabled Supported:
       *    Data Set Management TRIM supported (limit 10 blocks)
       *    Deterministic read ZEROs after TRIM

I am now suspicious/nervous that the cause of my issues may be SMR, which is obviously not something that can be fixed. I submitted a support ticket to Western Digital and provided all of this information, and asked if they would be willing to swap the "faulty" drive with a version that uses CMR instead of SMR (supposedly the WD40EFRX model uses CMR). I will post updates here either way just so there is one more case study out there.

A note about the never-ending extended test

Some google searches seem to indicate the extended/long SMART test that never finishes (90% complete/10% remains) is apparently a common issue -- even for drives that are good. I started running a long test on one of the other drives in my array and it has also been stuck at 10% remaining for quite some time. There are many theories out there regarding why this happens, but not much regarding fixes. I did find a possible workaround (link below) that I will try, but otherwise, this may be a frustrating bug.

Please edit question with output of smartctl -a /dev/sdX - this will show the actual SMART values, and tell us about the general state of the drive. If you have a single bad sector somewhere, it can happen that the test gets stuck. Writing that sector will allocate a spare, if the drive is healthy otherwise. — dirkt, Commented Apr 10, 2020 at 15:57
Drive looks healthy. Next I'd try to read the two blocks in the log with dd, and if the read fails, write them from /dev/null, then try to read them again, and check with the SMART values if they have been reallocated. If that works, run badblocks for good measure. — dirkt, Commented Apr 10, 2020 at 17:36
Yes, blocks can be read now. Check SMART values again to see if they've been reallocated by the test (Reallocated_Sector_Ct). So I guess next step would be to bring into again into the RAID, and see if the error happens again... — dirkt, Commented Apr 11, 2020 at 4:58

dirkt · Accepted Answer · 2020-04-15 04:16:57Z

Partial answer:

but does someone know how to decode the errors in the logs?

Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00

The SCSI command Synchronize Cache(10) failed, with no additional information reported by the device. The tag indicates that you are probably using the UAS protocol (USB attached SCSI), so you can have several commands in flight at the same time.

Apr 14 18:07:38 xenon kernel: blk_update_request: I/O error, dev sda, sector 2056

This happended when trying to update block 2056.

Apr 14 18:07:38 xenon kernel: md: super_written gets error=-5, uptodate=0

Which was called from the md layer.

Apr 14 18:07:38 xenon kernel: md/raid:md127: Disk failure on sda1, disabling device.#012md/raid:md127: Operation continuing on 3 devices.

Therefore the md layer decides to kick out that harddisk.

Is there any way to tell if this is a bad drive or if it's the drive controller that's going bad?

It's really difficult to say. Given that (a) it happens from time to, (b) it happens on similar sectors (i.e., when the md layer does similar things), and (c) you have UAS enabled, my current guess would be a driver/firmware bug that happens when processing commands in parallel, and some odd condition appears that the developers didn't anticipate.

As the SMART values are good, and the affected sectors can be read, physically the drive should be ok.

So next thing I'd to is to reduce the complexity of the software interaction, and see if it helps. So disable UAS for that drive (google), run for some time, see if the error still happens. Disabling UAS might decrease performance a bit.

The array is resyncing right now and I am currently looking into disabling UAS for this drive. However -- not sure if this matters, but -- the drive is not attached to the system via USB. I have a 4-bay enclosure (containing all 4 drives in the array). It connects to the system via a SAS interface (the HBA is an LSI SAS9200-8E). The setup was bulletproof for years using an older set of 2TB drives; I decided to upgrade 2 weeks ago when 4TB went on sale. That'll teach me. Is UAS still a factor when the enclosure is connected via SAS? — ngrusz1, Commented Apr 15, 2020 at 5:06
Oh, SAS, not UAS. In that case, I'd try something similar (reduce complexity by disabling tagging etc.), but then this will reduce performance for all 4 drives. Still worth trying, IMHO. I'd have to google myself to see what one can disable for SAS. — dirkt, Commented Apr 15, 2020 at 7:15
I'm also looking around to see what options are available (and to be clear, these are SATA drives that are connected to the system via a SAS controller). I can't seem to figure out how to disable tagged command queuing; my dmesg seems to report it is enabled with "tagged(1)" and I haven't found anything via google yet. Lots of stuff on disabling NCQ, which is also enabled [dmesg: ncq(y), qdepth(32)]. Would it be worth trying to disable NCQ? — ngrusz1, Commented Apr 15, 2020 at 23:03
Anything that reduces complexity is worth trying. In debugging, you try to make things simpler, so you can figure out which part works, and which part breaks. If disabling NCQ takes away enough parallelism that the problem no longer happens, that would both give you an interim solution and a hint at where to look for the real bug. — dirkt, Commented Apr 16, 2020 at 4:21
While working on disabling NCQ I came across this article. It turns out the drives I bought are the EFAX model noted in the article being sold surreptitiously as NAS-compatible even though they are SMR drives. Is it posssible that the fact that these are SMR drives could be the cause of the problem? — ngrusz1, Commented Apr 18, 2020 at 0:55

Stoat · Accepted Answer · 2020-04-21 15:32:03Z

0

use smartctl -x, not smartctl -a

You'll see the errors the drive is internally logging that way - IDNF ones in all liklihood.

This is a WD firmware error which they are currently refusing to acknowledge and is on top of the issue that the drive is a DM-SMR disguised as CMR unit.

answered Apr 21, 2020 at 15:32

Stoat

411 bronze badge

Just checked this, and you are correct. There were two errors of this type noted: "Error: IDNF at LBA = 0x00000808 = 2056." There was also one more error (earliest listed): "Error: UNC at LBA = 0x40e77400 = 1088910336." The drive is not failing the smart tests and there are no bad smart attributes. So what do these errors mean?
– ngrusz1
Commented Apr 22, 2020 at 3:40
According to various documents it happens if the head encounters a corrupted track or unrecorded track on a drive. It's an utterly catastrophic failure on a CMR disk but everything I've read points to it happening on SMR drives if seeking to a previously unrecorded part on the drive, or just past the end of the recorded area in any given SMR zone. This is a firmware bug that WD refuse to acknowledge - it should be masked on SMR drives - and the only fix is to RMA the drive.
– Stoat
Commented Apr 23, 2020 at 11:06

Add a comment |

ngrusz1 · Accepted Answer · 2020-05-16 20:10:58Z

Just wanted to provide a little bit of closure. The SMR vs. CMR issue is now common knowledge, so I'm guessing that this issue (combined with the probable firmware issue noted above) likely caused my problems. I reached out to WD and asked if they would replace my drives with the equivalent EFRX model (since this model uses CMR). As the drives were still within the retailer's return policy window, they suggested I return the drives. Since I couldn't return them without replacements on hand (as they contained data), I ordered four brand new WD Red Pro 4TB drives as replacements. I figured I would give WD one more chance, and surely the Pro drives would not have the same issues (note this was before WD released details regarding which drives use each technology)...

I received the new drives and immediately tested them using SMART Tools and badblocks. Every single drive returned numerous errors. Every. Single. Drive. Someone suggested that this could have been due to rough handling during shipping, but regardless -- I now had four more drives to return. I returned these to the retailer as defective, but by this point I was about to exhaust the return window for my original EFAX drives. I wouldn't be able to get a new set of drives, test them, swap them into my array, and then erase the original drives within the remaining return window.

I went back to my original WD ticket and explained the situation, and I again requested that they RMA my original drive with an EFRX version. And...they agreed! I was a tad surprised, but the support folks did agree to RMA my EFAX drive. I broke it to them that I actually had four EFAX drives and asked if they could RMA all four for the EFRX version, and they agreed to that as well. Finally, I requested an advance RMA so I could receive the new drives now and then send back the old ones once I have swapped everything out. They agreed to this as well.

Later on in the saga the support folks got back in touch to tell me that the EFRX model was currently out of stock in their warehouse but would be available soon. So they gave me the option of waiting or, instead of the EFRX drives, getting Red Pro drives. I was happy to take the Red Pro versions, and received them last week. All of these drives passed SMART Tools and badblocks testing, and I have successfully swapped them into my array. The new array hasn't been live for very long, but I'm hoping there won't be any further issues. So I'm pleased that WD (eventually) tried to make things right. It doesn't excuse their initial behavior, of course, but at least they appear to be listening to some of the criticism.

Stack Exchange Network

Non-specific kernel errors with new hard drive, is the drive failing?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
hard-drive
raid
smart
.

Hot Network Questions

Non-specific kernel errors with new hard drive, is the drive failing?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxhard-driveraidsmart.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
hard-drive
raid
smart
.