2

I have a home server that I use for Home Assistant with Ubuntu Server. All was good but someday the server stopped responding. I plugged it into a monitor and it was requiring a manual filesystem check on the root partion (which uses LVM). I ran fsck and it was all good again. Today, the system doesn't boot and on every boot it requires a manual fsck. First thing I thought was a bad SSD. I popped it into my main pc and I ran a smartctl command. Here is what comes up:

    === START OF INFORMATION SECTION ===
Device Model:     SPCC Solid State Disk
Serial Number:    AA230111S3051234838
LU WWN Device Id: 0 000000 000000000
Firmware Version: HPS1104J
User Capacity:    512.110.190.592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May 27 17:31:09 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (   4) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       8
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       9483
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       113
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       202
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       226541
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1640
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       154
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       858193939
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       90814572
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       28417544
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       754974721
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       19
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       67
194 Temperature_Celsius     0x0032   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       20922
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       11
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   050    Old_age   Always       -       138948
242 Total_LBAs_Read         0x0032   100   100   050    Old_age   Always       -       93039
249 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       4562399

SMART Error Log Version: 0
No Errors Logged

The command dmesg throws out multiple I/O errors:

[  528.586830] I/O error, dev sdd, sector 6400000 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  528.586859] I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[  528.586882] I/O error, dev sdd, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[  528.586900] I/O error, dev sdd, sector 2203648 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[  528.586929] I/O error, dev sdd, sector 6397952 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[  528.587118] device offline error, dev sdd, sector 6400000 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  528.587143] device offline error, dev sdd, sector 6400000 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  528.587147] Buffer I/O error on dev dm-2, logical block 0, async page read
[  528.587161] device offline error, dev sdd, sector 6400000 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  528.587164] Buffer I/O error on dev dm-2, logical block 0, async page read
[  528.587371] device offline error, dev sdd, sector 216115200 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  528.587382] device offline error, dev sdd, sector 216115200 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  528.587385] Buffer I/O error on dev dm-3, logical block 0, async page read
[  528.587395] Buffer I/O error on dev dm-3, logical block 0, async page read
[  590.357934] Buffer I/O error on dev dm-2, logical block 0, async page read
[  590.357940] Buffer I/O error on dev dm-2, logical block 0, async page read
[  590.358065] Buffer I/O error on dev dm-3, logical block 0, async page read
[  590.358069] Buffer I/O error on dev dm-3, logical block 0, async page read

So, the drive seems to be bad to me. Do you have any idea or suggestion? Is it indeed bad?

6
  • You are asking for opinions :) My advice: keep going but always prepare for the worst: have a spare at hand. If it is a production server others depend on you need to have a fallback server if you want 99.999+% uptime
    – Rinzwind
    Commented May 27 at 15:59
  • When you ran smartctl did you check sda or sdd ?
    – user10489
    Commented May 27 at 16:46
  • @user10489 I checked sdd Commented May 27 at 18:15
  • 2
    In my experience these things go downhill pretty quickly. Maybe it starts with 1 bad sector, then you get another in a month, then suddenly you get another 4 and then another 100. If you care about the data on the drive, considering that most likely performance will start degrading anyways, I advise you just replace it. If it still performs ok you can try and squeeze the remaining life out of it by keeping a very frequently (and I mean at least daily) updated backup, but even then there's a chance at some point it will stop being readable overnight.
    – kos
    Commented May 27 at 18:31
  • Typical failure mode for ssd's si to become read only when they run out of replacement blocks. But it can also become unreadable suddenly.
    – user10489
    Commented May 27 at 19:10

1 Answer 1

3

If the SSD contains critical, important or irreplaceable data, you should replace it as soon as possible.

Actually, this SSD should have been replaced a long, long time ago. Here are some important S.M.A.R.T. values you should know about:

Reallocated_Sector_Ct

Your disk have few (8 to be more precise) bad sectors (Reallocated_Sector_Ct):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
5   Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       8

Current_Pending_Sector

You also have 11 sectors that are waiting to be reallocated Current_Pending_Sector. These are sectors that have been identified as potentially problematic.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       11

When the drive encounters a sector that it cannot read, it marks it as "pending". The drive will attempt to read these sectors again in the future and:

  • If a future read attempt is successful, the sector is removed from the pending list, and the count of pending sectors decreases.

  • If a future read attempt fails, the sector is reallocated to a spare sector, and the Reallocated_Sector_Ct attribute increases, while the Current_Pending_Sector count decreases.

Reallocated_Event_Count

Your drive have 20922 reallocated sectors. Reallocated_Event_Count shows the total count of reallocation events. Each event corresponds to a problematic sector that has been moved (reallocated) to a reserved spare area on the drive.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
96  Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       20922

Available_Reservd_Space

You are out of spare sectors, Available_Reservd_Space shows 0. This is the amount of spare space on the SSD that is available for use when bad sectors are detected. This space is used to replace sectors that have failed or are in the process of failing.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       0
1
  • 1
    I ran some tests and it has already failed. I guess I'll contact the manufacturer since it's still in its warranty period. This has never happened to me to be fair Commented May 27 at 18:17

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .