The Complete Debugging Guide to Stop 0x124 – Part 2

We looked at the error packets in the first part, but now we need to investigate the structure of error records, and understand how to gather useful information from these error records to assist us in our debugging efforts. There is primarily two forms of error records you will see, one relates to processor type errors and the other corresponds to PCIe errors. Both have different error record sections which we will need to examine.

The following diagram illustrates the general structure of an error record:

We’ll start at the top of the error record, and then describe each section individually.

The general error record is described by the WHEA_ERROR_RECORD structure, which can be produced in WinDbg:

2: kd> dt nt!_WHEA_ERROR_RECORD
   +0x000 Header           : _WHEA_ERROR_RECORD_HEADER
   +0x080 SectionDescriptor : [1] _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR

The SectionDescriptor field is an array of WHEA_ERROR_RECORD_SECTION_DESCRIPTOR structures which describe each error section; there must be at least one error section for a error record, thus the reason why 1 element is always present within the array when we dump the error record in an arbitrary context.

Error Record Header:

The error record header is described by the WHEA_ERROR_RECORD_HEADER structure as seen below:

   +0x000 Signature        : Uint4B
   +0x004 Revision         : _WHEA_REVISION
   +0x006 SignatureEnd     : Uint4B
   +0x00a SectionCount     : Uint2B
   +0x00c Severity         : _WHEA_ERROR_SEVERITY
   +0x010 ValidBits        : _WHEA_ERROR_RECORD_HEADER_VALIDBITS
   +0x014 Length           : Uint4B
   +0x018 Timestamp        : _WHEA_TIMESTAMP
   +0x020 PlatformId       : _GUID
   +0x030 PartitionId      : _GUID
   +0x040 CreatorId        : _GUID
   +0x050 NotifyType       : _GUID
   +0x060 RecordId         : Uint8B
   +0x068 Flags            : _WHEA_ERROR_RECORD_HEADER_FLAGS
   +0x06c PersistenceInfo  : _WHEA_PERSISTENCE_INFO
   +0x074 Reserved         : [12] UChar

There isn’t too much to mention about the Error Record Header, apart from that I’ve highlighted the most useful fields to be discussed. The Signature field will contain the value of REPC, which might be useful to consider if you were looking at raw memory dumps and noticed that signature. The SectionCount is the number of error sections within the error record, there must be at least one error section.

The Severity field of the error record contains the _WHEA_ERROR_SEVERITY enumeration, this describes the severity of the hardware error which has occurred.

   WheaErrSevRecoverable = 0n0
   WheaErrSevFatal = 0n1
   WheaErrSevCorrected = 0n2
   WheaErrSevInformational = 0n3

The most common values are Fatal and Recoverable. They can be used to give an indication of how severe the error condition, and which action should be taken. The meanings of the values can be found on the MSDN website.

Error Record Section Descriptor:

   +0x000 SectionOffset    : Uint4B
   +0x004 SectionLength    : Uint4B
   +0x008 Revision         : _WHEA_REVISION
   +0x00b Reserved         : UChar
   +0x010 SectionType      : _GUID
   +0x020 FRUId            : _GUID
   +0x030 SectionSeverity  : _WHEA_ERROR_SEVERITY
   +0x034 FRUText          : [20] Char

The SectionType field contains the GUID for the error sections, and can take the following values:

  • Hardware Error Packet
  • Generic Processor Error
  • x86/x64 Processor Error
  • Itanium Processor Error
  • Itanium Processor Firmware Error Record Reference
  • Platform Memory Error
  • Nonmaskable Interrupt
  • PCI Express Error
  • PCI/PCI-X Bus Error
  • PCI/PCI-X Device Error

Before I begin to explain the meaning of the error sections listed, let’s examine briefly the error sections which you will most commonly see when examining error records. I will separate the error records into two types: processor related errors and PCI/PCIe related errors.

Processor Related Error Records:

The Generic Processor Error section is described by the below structure:

   +0x008 ProcessorType    : UChar
   +0x009 InstructionSet   : UChar
   +0x00a ErrorType        : UChar
   +0x00b Operation        : UChar
   +0x00c Flags            : UChar
   +0x00d Level            : UChar
   +0x00e Reserved         : Uint2B
   +0x010 CPUVersion       : Uint8B
   +0x018 CPUBrandString   : [128] UChar
   +0x098 ProcessorId      : Uint8B
   +0x0a0 TargetAddress    : Uint8B
   +0x0a8 RequesterId      : Uint8B
   +0x0b0 ResponderId      : Uint8B
   +0x0b8 InstructionPointer : Uint8B

As usual, I’ve highlight the fields which are the most useful for debugging purposes, although, it is quite interesting learning the meaning of the other fields anyway. This section is primarily used to provide information which is applicable across different processor architectures.

ProcessorType describes the processor architecture, this field currently takes the value of GENPROC_PROCTYPE_XPF (x86/x64) or GENPROC_PROCTYPE_IPF (Itanium). The InstructionSet field describes the instruction set which was being currently used at the time of crash, the current values are GENPROC_PROCISA_X86 or GENPROC_PROCISA_X64 for x86/x64 systems, and GENPROC_PROCISA_IPF Itanium systems.

ErrorType gives an indication of the type of error which has occured, this may be a TLB Cache Error, Bus Error or another Cache error. The field can take the following values: GENPROC_PROCERRTYPE_UNKNOWN; GENPROC_PROCERRTYPE_CACHE; GENPROC_PROCERRTYPE_TLB (Translation Lookaside Buffer); GENPROC_PROCERRTYPE_BUS; GENPROC_PROCERRTYPE_MAE (Microarchitecture error).

The Level field corresponds to the current cache level, where the error has occured. The CPUVersion is a union called WHEA_PROCESSOR_FAMILY_INFO which describes the stepping, family and model for the processor. There a far easiest method to obtain this information though, and I will show the extension in Part 3. The ProcessorId is simply the logical processor number where the error was reported.

Please for the above described fields to be valid and present within the error record section, then the corresponding bits must be set within the following union called WHEA_PROCESSOR_GENERIC_ERROR_SECTION_VALIDBITS:

   +0x000 ProcessorType    : Pos 0, 1 Bit
   +0x000 InstructionSet   : Pos 1, 1 Bit
   +0x000 ErrorType        : Pos 2, 1 Bit
   +0x000 Operation        : Pos 3, 1 Bit
   +0x000 Flags            : Pos 4, 1 Bit
   +0x000 Level            : Pos 5, 1 Bit
   +0x000 CPUVersion       : Pos 6, 1 Bit
   +0x000 CPUBrandString   : Pos 7, 1 Bit
   +0x000 ProcessorId      : Pos 8, 1 Bit
   +0x000 TargetAddress    : Pos 9, 1 Bit
   +0x000 RequesterId      : Pos 10, 1 Bit
   +0x000 ResponderId      : Pos 11, 1 Bit
   +0x000 InstructionPointer : Pos 12, 1 Bit
   +0x000 Reserved         : Pos 13, 51 Bits
   +0x000 ValidBits        : Uint8B

The x86/x64 Processor Error section is described by the structure below:

   +0x008 LocalAPICId      : Uint8B
   +0x010 CpuId            : [48] UChar
   +0x040 VariableInfo     : [1] UChar

This section is used to format any information which is specific to that particular processor architecture. The most interesting and useful field is the CpuId or CPUID. This section contains the stepping, model and family version numbers for the processor. The following illustrates an example output when the !errrec extension has been used:

CPU Id        : e5 06 01 00 00 08 10 00 - fd e3 98 00 ff fb eb bf
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

The Stepping is 5, the Model is 1e (30) and the Family is 6. We could have additionally used the !cpuid extension to obtain the same information:

0: kd> !cpuid
CP  F/M/S  Manufacturer     MHz
 0  6,30,5  GenuineIntel    1729

If you wondering why the Model is given as 30 and not 1e, then it because 1e is 30 in decimal.

0: kd> ? 1e
Evaluate expression: 30 = 00000000`0000001e

PCI/PCIe Related Error Records:

The PCI Express error section is described by the _WHEA_PCIEXPRESS_ERROR_SECTION,the structure does contain a number of interesting fields, however, exploring all these sub-structures would be outside the scope of this tutorial and would require specialist knowledge of the PCIe technology.

   +0x008 PortType         : _WHEA_PCIEXPRESS_DEVICE_TYPE
   +0x00c Version          : _WHEA_PCIEXPRESS_VERSION
   +0x010 CommandStatus    : _WHEA_PCIEXPRESS_COMMAND_STATUS
   +0x014 Reserved         : Uint4B
   +0x018 DeviceId         : _WHEA_PCIEXPRESS_DEVICE_ID
   +0x028 DeviceSerialNumber : Uint8B
   +0x034 ExpressCapability : [60] UChar
   +0x070 AerInfo          : [96] UChar

I’ve highlighted the most important fields, which are also automatically parsed by WinDbg by using the !errrec extension. The PortType describes the PCIe port of where the error occured, it is a enumeration of the following values:

   WheaPciExpressEndpoint = 0n0
   WheaPciExpressLegacyEndpoint = 0n1
   WheaPciExpressRootPort = 0n4
   WheaPciExpressUpstreamSwitchPort = 0n5
   WheaPciExpressDownstreamSwitchPort = 0n6
   WheaPciExpressToPciXBridge = 0n7
   WheaPciXToExpressBridge = 0n8
   WheaPciExpressRootComplexIntegratedEndpoint = 0n9
   WheaPciExpressRootComplexEventCollector = 0n10

A port can have two different definitions, depending upon if you look at the PCIe interface from a logical or physical standpoint; physically, a port is a collection of transmitters and receivers PCI requests which create a link, whereas, logically, a port is an interface between a component and a link. A link is simply a path of communication between two different devices.

The main PortType you’ll see mentioned in the PCIe dumps is the Root Port, the following definition is an extract from PCI Express Base 1.1 Specification:

“A PCI Express Port on a Root Complex that maps a portion of the Hierarchy through an associated virtual PCI-PCI Bridge”

Hierarchy simply means any component within the tree which represents the different components and layers of PCIe.

The next field is DeviceId which is much more simpler to understand than the previous example, this structure describes the Vendor ID and Device ID of the PCI/PCIe device which may be experiencing problems.

   +0x000 VendorID         : Uint2B
   +0x002 DeviceID         : Uint2B
   +0x004 ClassCode        : Pos 0, 24 Bits
   +0x004 FunctionNumber   : Pos 24, 8 Bits
   +0x008 DeviceNumber     : Pos 0, 8 Bits
   +0x008 Segment          : Pos 8, 16 Bits
   +0x008 PrimaryBusNumber : Pos 24, 8 Bits
   +0x00c SecondaryBusNumber : Pos 0, 8 Bits
   +0x00c Reserved1        : Pos 8, 3 Bits
   +0x00c SlotNumber       : Pos 11, 13 Bits
   +0x00c Reserved2        : Pos 24, 8 Bits

You can enter the Vendor ID and Device ID into a PCI Database and it should give you the device name.

The last important field for debugging is the AerInfo, which is actually represented by the following structure:

   +0x004 UncorrectableErrorStatus : _PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS
   +0x008 UncorrectableErrorMask : _PCI_EXPRESS_UNCORRECTABLE_ERROR_MASK
   +0x00c UncorrectableErrorSeverity : _PCI_EXPRESS_UNCORRECTABLE_ERROR_SEVERITY
   +0x010 CorrectableErrorStatus : _PCI_EXPRESS_CORRECTABLE_ERROR_STATUS
   +0x014 CorrectableErrorMask : _PCI_EXPRESS_CORRECTABLE_ERROR_MASK
   +0x018 CapabilitiesAndControl : _PCI_EXPRESS_AER_CAPABILITIES
   +0x01c HeaderLog        : [4] Uint4B
   +0x02c SecUncorrectableErrorStatus : _PCI_EXPRESS_SEC_UNCORRECTABLE_ERROR_STATUS
   +0x030 SecUncorrectableErrorMask : _PCI_EXPRESS_SEC_UNCORRECTABLE_ERROR_MASK
   +0x034 SecUncorrectableErrorSeverity : _PCI_EXPRESS_SEC_UNCORRECTABLE_ERROR_SEVERITY
   +0x038 SecCapabilitiesAndControl : _PCI_EXPRESS_SEC_AER_CAPABILITIES
   +0x03c SecHeaderLog     : [4] Uint4B

I’ve highlighted the two fields which are the most important when this structure is parsed by WinDbg. The two structures indicate the types of errors which have reported on that device. The errors are:

  • UR – Unsupported Request Error
  • MTLP – Malformed TLP
  • SD – Surprise Down
  • ROF – Receiver Overflow
  • UC – Unexcepted Completion
  • CT – Completion Timeout
  • DLP – Data Link Protocol Error
  • PTLP – Poisoned TLP
  • FCP – Flow Control Protocol Error
  • CA – Completer Abort
  • ECRC – End to End Reduncany Check Error

I’ve given a description of these errors before in another tutorial series on my blog called Debugging Stop 0x124 PCIe Errors Part 1-3. The captialisation of the letters indicates the type of error which has occured, more information about this will be given in Part 3 of this debugging tutorial series.

Please note the Root Port tyoe errors use the PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure instead.

In Part 3, we’ll begin to look at the debugging methodology involved for both types of bugchecks.


About 0x14c

I'm currently a Software Developer. My primary interests are Graph Theory, Number Theory, Programming Language Theory, Logic and Windows Debugging.
This entry was posted in Computer Science, Debugging, Stop 0x124, WinDbg, Windows Internals. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s