DEGRADED disk is back ONLINE after reboot

Question

I had a degraded disk on a ZFS volume in my FreeNAS server [build 9.10.2-U1 (86c7ef5)] and before trying to replace it, I rebooted the server.

What does the following mean and do I have an issue with that disk?

At startup, I get the following even though all disks are back online in volume status:
During the scrub operation, a new alert showed the disk in a degraded state, with 670 checksum (unsure what that means):

Scrub results:

The scrub operation is now finished. Here are the final results:

     state: DEGRADED
    status: One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are unaffected.

    action: Determine if the device needs to be replaced, and clear the errors
            using 'zpool clear' or replace the device with 'zpool replace'.

       see: http://illumos.org/msg/ZFS-8000-9P

      scan: scrub repaired 66.7M in 16h55m with 0 errors on Sat Jan  2 13:32:13 2021

    config:
      NAME                                            STATE     READ WRITE CKSUM
      storage                                         DEGRADED     0     0     0
        raidz1-0                                      DEGRADED     0     0     0
          gptid/e0ef3f08-70b6-11e6-b8eb-1c98ec0f2cd4  ONLINE       0     0     0
          gptid/e1b21671-70b6-11e6-b8eb-1c98ec0f2cd4  DEGRADED     0     0 1.29K  too many errors
          gptid/e2841c02-70b6-11e6-b8eb-1c98ec0f2cd4  ONLINE       0     0     0
          gptid/e3717f0c-70b6-11e6-b8eb-1c98ec0f2cd4  ONLINE       0     0     0

    errors: No known data errors

smartctl -a:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     39365         172825824
# 2  Extended offline    Completed: read failure       90%     39365         172825825
# 3  Short offline       Completed without error       00%     39364         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing

Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

The SMART data is going to be critical here, and running a scrub was probably a good first instinct. — Karu, Commented Jan 1, 2021 at 20:53
I have 4x3TB disks, used approximately at 60%. Which SMART test should I run? Long Self-Test, Short Self-Test, Conveyance Self-Test, Offline Immediate Test? — fharreau, Commented Jan 1, 2021 at 21:00
There's probably no need for the actual tests, but you should update this question with what the current smart data shows on the drive. I'm particularly interested in if it shows any pending/reallocated sectors. — Karu, Commented Jan 1, 2021 at 21:02
You should be able to go into a shell and run smartctl -a against the device. I don't think this is exposed in the FreeNAS GUI anywhere. This will show the drive's own health monitoring statistics. — Karu, Commented Jan 1, 2021 at 21:10

Karu · Accepted Answer · 2021-01-02 15:00:50Z

0

As the output from smartctl -a has shown, the drive is reporting read errors from its onboard testing. This eliminates your raid controller or a software problem.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     39365         172825824
# 2  Extended offline    Completed: read failure       90%     39365         172825825

This is very bad. Source a new drive and replace it ASAP. The error is probably transient because it appears to be happening around the same physical location on the disk - FreeNAS/zfs probably didn't access that exact spot again until you issued a scrub and told it to check that whole volume, which is why the drive came back online on next boot.

answered Jan 2, 2021 at 15:00

Karu

4,7727 gold badges38 silver badges56 bronze badges

Just an FYI, there is no RAID controller with ZFS, as it's software RAID.
– JW0914
Commented Jan 4, 2021 at 12:30
@JW0914 There is a disk controller, though, whether or not it's a RAID controller or not. And there very well could be a RAID controller in use here.
– Andrew Henle
Commented Jan 7, 2021 at 19:46
There is no hardware RAID controller involved if using ZFS... you don't do ZFS on top of hardware RAID - it's one or the other (there are reasons for this, pointless being one of them, performance degradation being another; for a complete understanding, please see the TrueNAS forum). A disk controller is not the same as a hardware RAID controller... they're two completely different things.
– JW0914
Commented Jan 8, 2021 at 11:58

Add a comment |

Stack Exchange Network

DEGRADED disk is back ONLINE after reboot

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
zfs
freenas
.

Hot Network Questions

DEGRADED disk is back ONLINE after reboot

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged zfsfreenas.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
zfs
freenas
.