I am running CentOS 7.7.1908 on a server with a RAID5 array (using mdadm software RAID). The array consists of four 4 TB drives. I recently replaced some of the drives with brand new WD Red drives. All was well for a little over a week, until I woke up one morning to a "fail" event. It seems one of the new drives (/dev/sda
) had been marked as failed and dropped out of the array.
I ran a short SMART self-test and the drive checked out ok. There were no other errors noted in the SMART log for the drive, so I added it back into the array. The array resynced successfully and everything appears ok. But since nothing precipitated the fail event, I'm concerned there might be something wrong with the drive.
Below are the syslog messages from the time the drive was "failed" out of the array:
Apr 9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr 9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 9 03:34:11 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr 9 03:34:11 server kernel: blk_update_request: I/O error, dev sda, sector 2056
Apr 9 03:34:11 server kernel: md: super_written gets error=-5, uptodate=0
Apr 9 03:34:11 server kernel: md/raid:md127: Disk failure on sda1, disabling device.#012md/raid:md127: Operation continuing on 3 devices.
Apr 9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr 9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 9 03:38:50 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr 9 03:38:50 server kernel: blk_update_request: I/O error, dev sda, sector 0
Apr 9 03:38:51 server kernel: mpt2sas_cm0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)
It's difficult to figure out exactly what happened since the error states there is "no additional sense information." However, after the resync completed I decided to run an extended SMART test on the drive. I started it yesterday afternoon and it was progressing fine...until I woke up this morning.
It has apparently been sitting at "10% of test remaining" all night, so I assume something isn't working properly. I also noted that the SMART information for this drive indicates that the "extended self-test routine recommended polling time" is 497 mins, while the time for the other drives in the array -- which are identical in make and model -- is approx. 205 mins.
So...perhaps this is a defective drive that has errors SMART doesn't record? Or is there something else that might be going on? Has anyone seen something like this before? Any help would be appreciated. Thanks!
Update: More Info
As requested, here is the output of smartctl for the drive in question
[user@localhost]~% sudo smartctl -a /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1062.18.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD40EFAX-68JH4N0
Serial Number: WD-XXXXXXXXXXXX
LU WWN Device Id: 5 0014ee 2bce22f9d
Firmware Version: 82.00A82
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Apr 10 11:02:15 2020 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 241) Self-test routine in progress...
10% of test remaining.
Total time to complete Offline
data collection: (23324) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 497) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3039) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 4
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 205
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 2
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 19
194 Temperature_Celsius 0x0022 114 107 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 177 -
# 2 Extended offline Interrupted (host reset) 10% 108 -
# 3 Short offline Completed without error 00% 0 -
# 4 Conveyance offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Update: Even More Info
Per the next suggestions from @dirkt, I tried to read from the sectors noted in the original syslog errors:
[user@localhost]~% sudo dd bs=512 if=/dev/sda1 of=./sector0-sda1.txt skip=0 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00244528 s, 209 kB/s
[user@localhost]~% sudo dd bs=512 if=/dev/sda1 of=./sector2056-sda1.txt skip=2056 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00281374 s, 182 kB/s
This is not something I'm terribly familiar with, but I assume this means the reads were successful? The file for sector 0 is empty, and the file for sector 2056 contains some gibberish. Should I attempt to write to them? Edit: I should probably add -- the SMART info remains the same after the reads. No errors logged, and the extended test is still at "10% remaining."
Update #3
Since it looks like I can read those sectors, it seems like they are ok. After reading (per above) them there were no updates in the SMART log:
[user@localhost]~% sudo smartctl -a /dev/sda
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 4
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 252
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 2
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 19
194 Temperature_Celsius 0x0022 111 107 000 Old_age Always - 36
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
So I added the drive back into the array. The resync was successful, and the error has not recurred yet. So maybe it's ok?
[user@localhost]~% cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdb1[7] sdc1[5] sdd1[4] sda1[6]
11721047040 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
One new thing I noticed: per my note below regarding extended self-tests, I tried doing selective self-tests by specifying smartctl -t select,0-max /dev/sdX
. Per the workaround below, this should mimic a long test but will provide a more detailed progress indicator. I ran this selective test on every drive, since the long tests were stuck at 10% remaining on every drive for days. For the 3 "good" drives in the array, the selective test completed without error within a reasonable amount of time (several hours, but less than a day). The selective test on the "questionable" drive (/dev/sda
) is taking much longer. It says 10% remaining like before, but the progress indicator is more useful:
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 7814037167 Self_test_in_progress [10% left] (5010947864-5011013399)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
It has been running for approx. 12 hours at this point. It's going very slowly (especially compared to the other drives) but still seems to be progressing. I will post an update when it finishes (if it finishes)...Edit: the selective self test has finally finished and it completed without error. So I guess that means all is well?
Update #4: The Return
Everything was working fine for the past week. Unfortunately this afternoon the same drive dropped out of the array again. The same errors popped up in the syslog:
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 14 18:07:38 xenon kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr 14 18:07:38 xenon kernel: blk_update_request: I/O error, dev sda, sector 2056
Apr 14 18:07:38 xenon kernel: md: super_written gets error=-5, uptodate=0
Apr 14 18:07:38 xenon kernel: md/raid:md127: Disk failure on sda1, disabling device.#012md/raid:md127: Operation continuing on 3 devices.
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 14 18:08:50 xenon kernel: sd 0:0:0:0: [sda] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Apr 14 18:08:50 xenon kernel: blk_update_request: I/O error, dev sda, sector 0
Apr 14 18:08:51 xenon kernel: mpt2sas_cm0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610)
After these errors I received the notification from mdadm:
[user@localhost]/var/log# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdb1[7] sdc1[5] sdd1[4] sda1[6](F)
11721047040 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
unused devices: <none>
I have started a selective SMART test on /dev/sda
but since none of the previous tests turned up anything wrong I'm not optomistic. Is there any way to tell if this is a bad drive or if it's the drive controller that's going bad? Since the same drive dropped in both cases, I'm inclined to think it's the drive, but does someone know how to decode the errors in the logs? Happy to provide more info. Thanks!
Update #5: The Saga Continues
For those who are following things, here is the latest:
- Since I had an old enclosure, I put my original 2 TB drives in it and quickly created a new "spare" array.
- I copied the contents of the 4 TB drive array to the spare array.
- I deleted the original array and created a new RAID10 array using the 4 TB drives (based on a variety of searches, it seems that RAID5 with large drives, particularly 4 or more, does not actually offer great performance or redundancy).
- The new array initialized successfully. I copied the original data from the 2 TB drive spare array to the new 4 TB drive RAID10 array.
- Based on the discussions with @dirkt below (who is awesome, btw), I have disabled NCQ on each of the 4TB drives via
echo 1 > /sys/block/sdX/device/queue_depth
. This is an effort to both reduce the complexity/parallelism of the array and because there are some discussions that indicate NCQ may actually be bad for RAID performance. I am letting the array run with this temporary fix to see if it resolves the issue. - Based on a tip from Mike Uchima at the Ars Technica comment boards (original post here), I also set the
noatime
mount option for the array's file system (which is not set by default in ext4 file systems). Per the comment board discussion, updating last-access times might overwhelm the SMR logic in the drive, ultimately leading to a drive being dropped. - I will post updates if the "faulty" drive (or another drive) drops out of the array again.
In addition, a number of media outlets have started reporting on some deceptive marketing practices by major hard drive manufacturers, including Western Digital (an example is linked here). It appears they are using Shingled Magnetic Recording (SMR) in several models of their Red drives without labeling or advertising them as such, even though SMR has been known to cause problems with NAS and RAID configurations (ironically, some of the issues with SMR are mentioned here in WD's own materials, where they point out that drive-managed SMR is bad for parallel operations...like RAID). This is obviously a problem since the Red drives are specifically marketed for NAS and RAID purposes.
The model of 4 TB drive I purchased is suspected of being among the models of drives that use SMR (model WD40EFAX). Based on the news articles, EFAX models with 256 MB of cache (like mine) are likely to use SMR. Using hdparm -I
shows that my drives support TRIM, which is apparently another indicator that a drive uses SMR:
[user@localhost]~% sudo hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: WDC WD40EFAX-68JH4N0
...
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* Data Set Management TRIM supported (limit 10 blocks)
* Deterministic read ZEROs after TRIM
I am now suspicious/nervous that the cause of my issues may be SMR, which is obviously not something that can be fixed. I submitted a support ticket to Western Digital and provided all of this information, and asked if they would be willing to swap the "faulty" drive with a version that uses CMR instead of SMR (supposedly the WD40EFRX model uses CMR). I will post updates here either way just so there is one more case study out there.
A note about the never-ending extended test
Some google searches seem to indicate the extended/long SMART test that never finishes (90% complete/10% remains) is apparently a common issue -- even for drives that are good. I started running a long test on one of the other drives in my array and it has also been stuck at 10% remaining for quite some time. There are many theories out there regarding why this happens, but not much regarding fixes. I did find a possible workaround (link below) that I will try, but otherwise, this may be a frustrating bug.
smartctl -a /dev/sdX
- this will show the actual SMART values, and tell us about the general state of the drive. If you have a single bad sector somewhere, it can happen that the test gets stuck. Writing that sector will allocate a spare, if the drive is healthy otherwise.dd
, and if the read fails, write them from/dev/null
, then try to read them again, and check with the SMART values if they have been reallocated. If that works, runbadblocks
for good measure.Reallocated_Sector_Ct
). So I guess next step would be to bring into again into the RAID, and see if the error happens again...