3

The Error Information Log Entries value showed by smartctl -a /dev/nvme0n1 in my NVMe is growing fast, by 1 per second. Is it indicative of a faulty driver?

At the same time, Media and Data Integrity Errors is currently showing a value of 0.

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SKC3000D4096G
Serial Number:                      xxxxx
Firmware Version:                   EIFK31.6
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 4,096,805,658,624 [4.09 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 282b2ba6c5
Local Time is:                      Fri Mar 24 01:33:14 2023 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.80W       -        -    0  0  0  0        0       0
 1 +     7.10W       -        -    1  1  1  1        0       0
 2 +     5.20W       -        -    2  2  2  2        0       0
 3 -   0.0620W       -        -    3  3  3  3     2500    7500
 4 -   0.0620W       -        -    4  4  4  4     2500    7500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        55 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    213,006,510 [109 TB]
Data Units Written:                 549,370,112 [281 TB]
Host Read Commands:                 11,210,192,197
Host Write Commands:                20,687,602,229
Controller Busy Time:               14,055
Power Cycles:                       39
Power On Hours:                     4,204
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,479,242
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               75 Celsius
Thermal Temp. 1 Total Time:         58745

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0    1479242     0  0x2015  0x4004 0x102c            0     0     -
  1    1479241     0  0x2014  0x4004 0x102c            0     0     -
  2    1479240     0  0xd010  0x4004 0x102c            0     0     -
  3    1479239     0  0xc013  0x4004 0x102c            0     0     -
  4    1479238     0  0xb011  0x4004 0x102c            0     0     -
  5    1479237     0  0x8009  0x4004 0x102c            0     0     -
  6    1479236     0  0x0015  0x4004 0x102c            0     0     -
  7    1479235     0  0x0014  0x4004 0x102c            0     0     -
  8    1479234     0  0xa011  0x4004 0x102c            0     0     -
  9    1479233     0  0xa010  0x4004 0x102c            0     0     -
 10    1479232     0  0x9012  0x4004 0x102c            0     0     -
 11    1479231     0  0x9011  0x4004 0x102c            0     0     -
 12    1479230     0  0x6000  0x4004 0x102c            0     0     -
 13    1479229     0  0x5003  0x4004 0x102c            0     0     -
 14    1479228     0  0x4001  0x4004 0x102c            0     0     -
 15    1479227     0  0x4000  0x4004 0x102c            0     0     -
... (47 entries not read)

I uploaded the output of nvme error-log /dev/nvme0n1 too: https://pastebin.com/SQJM7KhV

5
  • Do errors 'stabilize' once you stop sending SMART queries? Commented Mar 24, 2023 at 12:05
  • No they keep growing at a rate of ~1 per sec
    – Gotenks
    Commented Mar 24, 2023 at 14:39
  • Planned obsolescence? ~31 M err/yr ;-) Commented Mar 24, 2023 at 15:55
  • No idea, nothing I can find either. If I had to guess it concerns a command not related to 'transport', so no data goes from one place to another. LBA and name space both being zero could mean it's not trying to read/write anything. It sounds as if something is issuing a command the drive does not recognize. Commented Mar 25, 2023 at 0:56
  • I emailed the Kingston support. Will see what they say
    – Gotenks
    Commented Mar 25, 2023 at 8:53

1 Answer 1

4

In my case, it was caused by Node Exporter (Prometheus).

After stopping the process the Error Information Log Entries stopped increasing. Probably it's making queries which are not supported by the NVMe driver (will have to dig deeper).

UPDATE: I edited the hwmon collector code to exclude the faulty sensor: https://github.com/prometheus/node_exporter/issues/2643

1
  • 1
    Oh nice you solved it! Commented Mar 26, 2023 at 15:58

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .