Looking for smartctl NVMe log error explanation (0xa013 0x8004 and 0x9016 0x8004)

Question

For my 1TB NVMe smartctl -a /dev/nvme0n1 first reported this:

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        129     0  0xa013  0x8004  0x000            0     0     -

Then I ran a long self-test, now the error message has changed to:

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        130     0  0x9016  0x8004  0x000            0     0     -

No idea what that all means and whether it's even valid in the first place because nvme self-test-log /dev/nvme0 seems to be fine:

Device Self Test Log for NVME device:nvme0
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x16
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x16
  Vendor Specific              : 0 0
Self Test Result[2]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x8
  Vendor Specific              : 0 0
Self Test Result[3]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x7
  Vendor Specific              : 0 0
Self Test Result[4]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x7
  Vendor Specific              : 0 0
Self Test Result[5]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x6
  Vendor Specific              : 0 0
Self Test Result[6]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x5
  Vendor Specific              : 0 0
Self Test Result[7]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x5
  Vendor Specific              : 0 0
Self Test Result[8]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x4
  Vendor Specific              : 0 0
Self Test Result[9]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x3
  Vendor Specific              : 0 0
Self Test Result[10]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x3
  Vendor Specific              : 0 0
Self Test Result[11]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x2
  Vendor Specific              : 0 0
Self Test Result[12]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1
  Vendor Specific              : 0 0
Self Test Result[13]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1
  Vendor Specific              : 0 0
Self Test Result[14]:
  Operation Result             : 0xf
Self Test Result[15]:
  Operation Result             : 0xf
Self Test Result[16]:
  Operation Result             : 0xf
Self Test Result[17]:
  Operation Result             : 0xf
Self Test Result[18]:
  Operation Result             : 0xf
Self Test Result[19]:
  Operation Result             : 0xf

Any NVMe/S.M.A.R.T. professionals here? Google is dead silent (just two results with no explanations).

Using smartctl 7.3 2022-02-28 r5338 and nvme-2.4.

Joep van Steen · Accepted Answer · 2023-10-15 22:21:46Z

I haven't figured this out completely yet, hard to find the info and hard to interpret the info (for my limited brain at least), but let's go from this:

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        129     0  0xa013  0x8004  0x000            0     0     -

I am at this point interested in the status code. I found this: https://github.com/linux-nvme/nvme-cli/issues/800 and I am looking at answer from "birkelund" (NVMe Software Engineer), someone is asking for status code 0xc502 and he explains you decode like so:

If you are asking how that error code is encoded in 0xC502, then its 0xC502 >> 1 to get rid of the Phase Tag. That leave us with 0x6281. Then apply a mask of 0x7ff to extract the lower 11 bytes (3 for the Status Code Type and 8 for the Status Code), ending up with 0x281. 0x2xx are "Media and Data Integrity Errors" and the 0x81 status code is "Unrecovered Read Error".

In more human language the bit shift is a division by 2 (0xC502 / 2 = 0x6281). Applying mask 0x7ff gives us the 3 right-side nibbles (0x6281 -> 0x281). I think this makes decoding a tad easier.

To see how he gets Status Code Type and Status Code I did some more research:

Lookup for code types:

NVME_SCT_GENERIC        = 0x0,
NVME_SCT_COMMAND_SPECIFIC   = 0x1,
NVME_SCT_MEDIA_ERROR        = 0x2,
/* 0x3-0x6 - reserved */
NVME_SCT_VENDOR_SPECIFIC    = 0x7,

Lookup for "Media and Data Integrity Errors":

NVME_SC_WRITE_FAULTS            = 0x80,
NVME_SC_UNRECOVERED_READ_ERROR      = 0x81,
NVME_SC_GUARD_CHECK_ERROR       = 0x82,
NVME_SC_APPLICATION_TAG_CHECK_ERROR = 0x83,
NVME_SC_REFERENCE_TAG_CHECK_ERROR   = 0x84,
NVME_SC_COMPARE_FAILURE         = 0x85,
NVME_SC_ACCESS_DENIED           = 0x86,

So we can now see where he gets:

0x2xx are "Media and Data Integrity Errors" and the 0x81 status code is "Unrecovered Read Error"

If I apply same logic/method to status 0x8004 I get something like:

Shift Right: Shifting the value 0x8004 by one bit to the right (0x8004 >> 1), gets 0x4002. Then the masking: Applying a mask of 0x7ff extracts the lower 11 bits of the value 0x4002, yielding 0x002.

So 0 gets us NVME_SCT_GENERIC Status Code Type (see above table) and Generic Status Codes lookup:

NVME_SC_SUCCESS             = 0x00,
NVME_SC_INVALID_OPCODE          = 0x01,
NVME_SC_INVALID_FIELD           = 0x02,
NVME_SC_COMMAND_ID_CONFLICT     = 0x03,
NVME_SC_DATA_TRANSFER_ERROR     = 0x04,
NVME_SC_ABORTED_POWER_LOSS      = 0x05,
NVME_SC_INTERNAL_DEVICE_ERROR       = 0x06,
NVME_SC_ABORTED_BY_REQUEST      = 0x07,
NVME_SC_ABORTED_SQ_DELETION     = 0x08,
NVME_SC_ABORTED_FAILED_FUSED        = 0x09,
NVME_SC_ABORTED_MISSING_FUSED       = 0x0a,
NVME_SC_INVALID_NAMESPACE_OR_FORMAT = 0x0b,
NVME_SC_COMMAND_SEQUENCE_ERROR      = 0x0c,

NVME_SC_LBA_OUT_OF_RANGE        = 0x80,
NVME_SC_CAPACITY_EXCEEDED       = 0x81,
NVME_SC_NAMESPACE_NOT_READY     = 0x82,

So, now we have type error 0 (NVME_SCT_GENERIC = 0x0) and 02 for status (NVME_SC_INVALID_FIELD = 0x02). I have no idea what it means, but to me it does not sound like an issue with your NVMe drive itself. If SMART is 'clean' I think you have not much to worry about.

Looking for a further explanation I found:

NVME_SC_INVALID_FIELD - Invalid Field in Command: A reserved coded value or an unsupported value in a defined field.

Also, as far as I can tell CmdId (0xa013 in this case) is not an actual command but an ID for a 'structure' or packet that contains an actual command and parameters that you can pass on to the queue. So in itself CmdId 0xa013 tells us nothing about the actual command the host was sending to the drive.

Disclaimer: Math, bit-shifting and all that is not my strong point so I may have made an error, a typo or whatever in the calculator, you should check it before relying on it.

I have a drive also reporting 0x8004. My version of smartctl (7.4) is able to decode it and indeed it is "Invalid Field in Command". — Arnavion, Commented Oct 28, 2023 at 1:09

harrymc · Accepted Answer · 2023-07-31 14:30:27Z

I assume that all the SMART attributes are healthy. If you suspect they are not healthy, please add their screenshot to your post.

I have tried chiefly to understand the status code of 0x8004 that you received. I have found this code in the file nvmecmds.cpp which is part of the smartmontools package at line 294:

// Return flagged error message for NVMe status SCT/SC fields or nullptr if unknown.
// If message starts with '-', the status indicates an invalid command (EINVAL).
static const char * nvme_status_to_flagged_str(uint16_t status)
{
  // Section 3.3.3.2.1 of NVM Express Base Specification Revision 2.0c, October 4, 2022
  uint8_t sc = (uint8_t)status;
  switch ((status >> 8) & 0x7) {
    case 0x0: // Generic Command Status
      if (sc < 0x80) switch (sc) {
        case 0x00: return "Successful Completion";
        case 0x01: return "-Invalid Command Opcode";
        case 0x02: return "-Invalid Field in Command";
        case 0x03: return "Command ID Conflict";
        case 0x04: return "Data Transfer Error";

Following the above code, if status is 0x8004, then the variable sc contains the lower 8 bits which is 4. When we execute the switch expression, we need to discard the lower 8 bits and take the next 3 bits, giving as result 0. Following the switch case for 0x0, since sc is less-than 0x80 (is 4) then we have the result status to be the case 0x04 which says "Data Transfer Error".

I can't say if this error is serious or not. Only an examination of the SMART attributes may give some indication about the health of this disk.

This is wrong. 0x8004 is "Invalid Field in Command" as superuser.com/a/1803785/814745 says, not "Data Transfer Error". My version of smartctl (7.4) decodes it and agrees: Status: 0x8004, Message: "Invalid Field in Command" — Arnavion, Commented Oct 28, 2023 at 0:31

Artem S. Tashkinov · Accepted Answer · 2023-08-09 16:47:47Z

I've downloaded the Micron SMART utility for this NVMe device called msecli and it says that NVMe log is not supported.

This could mean that smartctl is trying to read something which is not there and reports it as a bogus error. I'll have to email the company and find out what exactly is happening.

Edit: I've contacted Micron and they told me since the drive is OEM branded they cannot provide support for it. They've not answered any questions.

Smartctl 7.4 now shows Self-test Log (NVMe Log 0x06) and it has no errors, but the other Log still contains errors:

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0        143     0  0x001c  0x8004  0x000            0     0     -  Invalid Field in Command

Maybe this error is bogus considering it's always changing.

Maybe someone will be more lucky dealing with Micron and we'll find out what these errors are about.

scrat.squirrel · Accepted Answer · 2023-12-17 23:14:07Z

0

A more complete explanation and codes on a different question / answer on superuser dot com, here:

Unable to identify SMART errors/issues of my NVMe disk

answered Dec 17, 2023 at 23:14

community wiki

scrat.squirrel

Add a comment |

Stack Exchange Network

Looking for smartctl NVMe log error explanation (0xa013 0x8004 and 0x9016 0x8004)

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
smart
nvme
smartctl
.

Linked

Hot Network Questions

Looking for smartctl NVMe log error explanation (0xa013 0x8004 and 0x9016 0x8004)

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged smartnvmesmartctl.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
smart
nvme
smartctl
.