How to investigate ext4 filesystem going read-only (no hardware errors reported)?

Question

I have two separate systems experiencing what appears to be the same issue:

"Desktop" - i7-7700K, ASUS Prime Z270-A, Ubuntu 22.04. Kernel 5.15.0-94-generic.
"Server" - NUC8i3BEH, Ubuntu 18.04. Kernel 4.15.0-213-generic.

These units were originally running 250GB Samsung 970 EVO Plus NVMe drives as their system drives for over four years, with no issues encountered. Server has accumulated multiple months of continuous uptime in the past.

In December 2023, owing to capacity problems, both machines were upgraded to 4TB Samsung 990 PRO with Heatsink drives (model MZ-V9P4T0GW). Both drives have a production date of 02/11/2023.

I used dd (edit: dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=128M status=progress) to clone the old 250GB drives directly onto the 4TB drives (the drives were not mounted), then expanded the system partitions created on the new drives to fill the extra space.

The 4TB drives went into their respective systems which proceeded to boot normally. However, since then, both machines have been regularly entering read-only filesystem mode; the interval varies, but typically after 2-3 days of uptime. After rebooting, the machines function normally again until the next time the issue occurs.

Both drives are running the latest Samsung firmware and report no SMART errors — smartctl output from Desktop this morning:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-94-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Serial Number:                      S7HRNJ0WB00087A
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            561,545,682,944 [561 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4b31404e5a
Local Time is:                      Fri Feb  9 10:45:39 2024 GMT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    64,414,898 [32.9 TB]
Data Units Written:                 4,803,602 [2.45 TB]
Host Read Commands:                 497,378,636
Host Write Commands:                49,098,349
Controller Busy Time:               965
Power Cycles:                       286
Power On Hours:                     425
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               29 Celsius
Temperature Sensor 2:               31 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

An extended SMART test and Full LBA Scan in Samsung Magician with the drive installed in a Windows system reported no errors.

The filesystem appears to be going read-only before any explanatory system logs are written, so I've so far been unable to determine what prompts the issue.

Last week I booted into a live environment and ran e2fsck -fv on each drive. This found and fixed inode extent tree could be narrower errors. The issue has continued to recur since then.

This morning I experienced the problem again on Desktop. I then booted a live USB, ran e2fsck -fv again, and found it reported and fixed more inode extent tree issues:

Afterwards, I rebooted back into the system and it functioned normally; however, 10 minutes later after returning to the machine having stepped away, it had failed again (the shortest uptime yet endured). The console was filled with these errors:

__ext4_find_entry:1682: inode #2 (<process name>): reading lblock 0

I returned to the live USB environment, ran e2fsck -fv again, and found it this time reported no errors and changed nothing:

I rebooted and the system was operating normally (and has subsequently stayed up for the remainder of the day). Meanwhile, Server has accumulated just over 2 days of uptime since it last experienced the problem.

I'm not sure how to proceed at this point. Please can I ask for suggestions on where to take this investigation process next:

I first thought there might be a physical hardware issue. However, subsequent tests indicate the drives are consistently reported as healthy, and it's now feeling more like a filesystem issue.
It is notable, I think, that everything about these two systems is different, except they are using the same model drives, and have been through the same drive migration process.
I wonder if the dd has caused corruption, although no errors were reported at the time. I'm not (yet) sure how to investigate or resolve this.

Considering its trivial to backup - would backing up again, doing a fresh install and moving over required files - this process works for me and seeing it happens again be an option, that would rule out the third or maybe the second. Since both systems have the same issue, try the less critical one first — Journeyman Geek, Commented Feb 9 at 11:12
"I used dd to clone the old 250GB drives directly onto the 4TB drives" - did you clone the system from itself, so that the source disk was running a live filesystem? If so, unfortunately this shows exactly why that's such a bad idea. Never ever clone a live filesystem — Chris Davies, Commented Feb 9 at 12:42
@ChrisDavies no, the source disks weren't live during the clone; source (250GB) and destination (4TB) drives were installed unmounted in a system booted from live USB. dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=128M status=progress is the command that was used in each case. — ilmiont, Commented Feb 9 at 22:24
Reinstalling and restoring from backup is an option. However then I won't know what has actually happened. I would like to fully investigate, but I don't feel as though I have enough knowledge of filesystems to put together a plan myself. I've put the drive from my Personal machine into a Windows system tonight and completed a Full block scan + extended SMART test in Samsung Magician - no errors found, all signs seem to be that hardware is OK and something is wrong with the filesystem. — ilmiont, Commented Feb 9 at 22:28
(I have tidied up this question that was originally hastily written in working hours — i.e. the title is now phrased as a question (!), some sections have been clarified, and redundant asides have been removed.) — ilmiont, Commented Feb 10 at 10:23

Stack Exchange Network

How to investigate ext4 filesystem going read-only (no hardware errors reported)?

0

You must log in to answer this question.

Browse other questions tagged
linux
filesystems
ext4
.

Hot Network Questions

How to investigate ext4 filesystem going read-only (no hardware errors reported)?

0

You must log in to answer this question.

Browse other questions tagged linuxfilesystemsext4.

Related

Hot Network Questions

Browse other questions tagged
linux
filesystems
ext4
.