2

I had a degraded disk on a ZFS volume in my FreeNAS server [build 9.10.2-U1 (86c7ef5)] and before trying to replace it, I rebooted the server.

What does the following mean and do I have an issue with that disk?

  • At startup, I get the following even though all disks are back online in volume status: Alert

  • During the scrub operation, a new alert showed the disk in a degraded state, with 670 checksum (unsure what that means): Degraded disk New Alert

  • Scrub results:
    The scrub operation is now finished. Here are the final results:
    
         state: DEGRADED
        status: One or more devices has experienced an unrecoverable error.  An
                attempt was made to correct the error.  Applications are unaffected.
    
        action: Determine if the device needs to be replaced, and clear the errors
                using 'zpool clear' or replace the device with 'zpool replace'.
    
           see: http://illumos.org/msg/ZFS-8000-9P
    
          scan: scrub repaired 66.7M in 16h55m with 0 errors on Sat Jan  2 13:32:13 2021
    
        config:
          NAME                                            STATE     READ WRITE CKSUM
          storage                                         DEGRADED     0     0     0
            raidz1-0                                      DEGRADED     0     0     0
              gptid/e0ef3f08-70b6-11e6-b8eb-1c98ec0f2cd4  ONLINE       0     0     0
              gptid/e1b21671-70b6-11e6-b8eb-1c98ec0f2cd4  DEGRADED     0     0 1.29K  too many errors
              gptid/e2841c02-70b6-11e6-b8eb-1c98ec0f2cd4  ONLINE       0     0     0
              gptid/e3717f0c-70b6-11e6-b8eb-1c98ec0f2cd4  ONLINE       0     0     0
    
        errors: No known data errors
    

  • smartctl -a:
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed: read failure       90%     39365         172825824
    # 2  Extended offline    Completed: read failure       90%     39365         172825825
    # 3  Short offline       Completed without error       00%     39364         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    
14
  • The SMART data is going to be critical here, and running a scrub was probably a good first instinct.
    – Karu
    Commented Jan 1, 2021 at 20:53
  • I have 4x3TB disks, used approximately at 60%. Which SMART test should I run? Long Self-Test, Short Self-Test, Conveyance Self-Test, Offline Immediate Test?
    – fharreau
    Commented Jan 1, 2021 at 21:00
  • There's probably no need for the actual tests, but you should update this question with what the current smart data shows on the drive. I'm particularly interested in if it shows any pending/reallocated sectors.
    – Karu
    Commented Jan 1, 2021 at 21:02
  • What do you mean by "smart data"?
    – fharreau
    Commented Jan 1, 2021 at 21:08
  • You should be able to go into a shell and run smartctl -a against the device. I don't think this is exposed in the FreeNAS GUI anywhere. This will show the drive's own health monitoring statistics.
    – Karu
    Commented Jan 1, 2021 at 21:10

1 Answer 1

0

As the output from smartctl -a has shown, the drive is reporting read errors from its onboard testing. This eliminates your raid controller or a software problem.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     39365         172825824
# 2  Extended offline    Completed: read failure       90%     39365         172825825

This is very bad. Source a new drive and replace it ASAP. The error is probably transient because it appears to be happening around the same physical location on the disk - FreeNAS/zfs probably didn't access that exact spot again until you issued a scrub and told it to check that whole volume, which is why the drive came back online on next boot.

3
  • Just an FYI, there is no RAID controller with ZFS, as it's software RAID.
    – JW0914
    Commented Jan 4, 2021 at 12:30
  • @JW0914 There is a disk controller, though, whether or not it's a RAID controller or not. And there very well could be a RAID controller in use here. Commented Jan 7, 2021 at 19:46
  • There is no hardware RAID controller involved if using ZFS... you don't do ZFS on top of hardware RAID - it's one or the other (there are reasons for this, pointless being one of them, performance degradation being another; for a complete understanding, please see the TrueNAS forum). A disk controller is not the same as a hardware RAID controller... they're two completely different things.
    – JW0914
    Commented Jan 8, 2021 at 11:58

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .