I have two separate systems experiencing what I believe is likely to be the same issue: 1. **My personal machine** (henceforth Desktop). i7-7700K, ASUS Prime Z270-A, Ubuntu 22.04. Kernel 5.15.0-94-generic. 2. **A server machine** (henceforth Server). NUC8i3BEH, Ubuntu 18.04. Kernel 4.15.0-213-generic. These units were originally running 250GB Samsung 970 EVO Plus NVMe drives as their system drives for over four years, with no issues encountered. Server has accumulated multiple months of uptime between reboots in the past. In December 2023, owing to capacity problems, both machines were upgraded to **4TB Samsung 990 PRO with Heatsink** drives (model MZ-V9P4T0GW). Both drives have a production date of 02/11/2023. I used `dd` (edit: `dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=128M status=progress`) to clone the old 250GB drives directly onto the 4TB drives (the drives were not mounted), then expanded the system partition to fill the extra space. The 4TB drives went into the respective systems which proceeded to boot normally. However, since then, both machines have been regularly entering read-only filesystem mode; the interval varies, but typically after 2-3 days of uptime. After rebooting, the machines function normally again until the next time the issue occurs. --- Both drives are running the latest Samsung firmware and report no SMART errors. The ext4 filesystem is used. `smartctl` output from Desktop this morning: ``` smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-94-generic] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: Samsung SSD 990 PRO with Heatsink 4TB Serial Number: S7HRNJ0WB00087A Firmware Version: 4B2QJXD7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 4,000,787,030,016 [4.00 TB] Unallocated NVM Capacity: 0 Controller ID: 1 NVMe Version: 2.0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 4,000,787,030,016 [4.00 TB] Namespace 1 Utilization: 561,545,682,944 [561 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 4b31404e5a Local Time is: Fri Feb 9 10:45:39 2024 GMT Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other* Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 82 Celsius Critical Comp. Temp. Threshold: 85 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 9.39W - - 0 0 0 0 0 0 1 + 9.39W - - 1 1 1 1 0 0 2 + 9.39W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 4200 2700 4 - 0.0050W - - 4 4 4 4 500 21800 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 29 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 64,414,898 [32.9 TB] Data Units Written: 4,803,602 [2.45 TB] Host Read Commands: 497,378,636 Host Write Commands: 49,098,349 Controller Busy Time: 965 Power Cycles: 286 Power On Hours: 425 Unsafe Shutdowns: 20 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 29 Celsius Temperature Sensor 2: 31 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged ``` The filesystem appears to be going read-only before any logs are written, so I've so far been unable to determine the precise cause of the issue. Last week I booted into a live environment and ran `e2fsck -fv` on each drive. This found and fixed `inode extent tree could be narrower` issues. The issue has continued to recur since then. This morning I experienced the problem again on my personal machine. I booted a live USB again, ran `e2fsck -fv` again, and it again reported inode extent tree issues: [![enter image description here][1]][1] I booted back up into Linux which functioned normally; however, 10 minutes later after returning to the PC, it had died again (the shortest uptime yet endured). The console was filled with these errors: ``` __ext4_find_entry:1682: inode #2 (<process name>): reading lblock 0 ``` I returned to the live USB environment, ran `e2fsck -v` again, found it reported **no** errors and changed nothing this time: [![enter image description here][2]][2] I rebooted back into Linux which I'm now using to write this. Meanwhile, Server has accumulated just over 2 days of uptime since it last experienced the problem. --- I'm not sure how to proceed at this point. Please can I ask for suggestions on where to take this investigation process next. Thoughts + speculations: 1. I am concerned there is a physical hardware issue. However, the *drives* appear to be reported healthy, and I think this is a filesystem issue. 2. It is notable, I think, that everything about these two systems is different, except they are using the same model drives, and have been through the same migration process. 3. I wonder if the `dd` has caused corruption, although no errors were reported at the time. I don't know how to investigate or resolve this. [1]: https://i.sstatic.net/TM2hF.jpg [2]: https://i.sstatic.net/LKIWe.jpg