The Complete Debugging Guide to Stop 0x124 – Part 3

In the previous two parts, we examined error packets and error records, now we will begin to discuss the debugging methodology involved with a Stop 0x124 bugcheck, and how to gather useful debugging information from the dump file using WinDbg. I’ve split this final part into two sections: processor type errors and PCIe type errors, since these are both the most common errors you’ll experience when debugging a Stop 0x124 bugcheck.

Processor Type Errors:

Upon loading the dump file into WinDbg, you will be greeted with the following parameters:

BugCheck 124, {0, fffffa80080d2028, f6000d80, 40150}

Probably caused by : GenuineIntel

The first parameter is the type of error source as discussed in Part 1. As mentioned previously, it is stored within the enumeration called _WHEA_ERROR_SOURCE_TYPE, and from looking at the value of the parameter we know that the error source type was a Machine Check Exception (MCE). A MCE is a troubleshooting mechanism used by the processor to report hardware errors to the operating system. It can be used to report a wide variety of errors, including cache errors, bus errors and memory errors. The most common from my experience is the that MCE reports cache errors.

The second parameter is the address of the error record, which as explained in Part 2, is represented by _WHEA_ERROR_RECORD. This is the most important parameter of both of the bugcheck types. We will be using the !errrec extension to dump this structure and examine the sections which were also discussed in Part 2.

The third and fourth parameters are the higher and lower bits of the MCi_STATUS registers, which do not have any significant additional debugging value, apart from self interest of the CPU architecture. If you wish, you can dump the contents in WinDbg using the following:

0: kd> dt hal!_MCi_STATUS
   +0x000 McaErrorCode     : Uint2B
   +0x002 ModelErrorCode   : Uint2B
   +0x004 OtherInformation : Pos 0, 23 Bits
   +0x004 ActionRequired   : Pos 23, 1 Bit
   +0x004 Signalling       : Pos 24, 1 Bit
   +0x004 ContextCorrupt   : Pos 25, 1 Bit
   +0x004 AddressValid     : Pos 26, 1 Bit
   +0x004 MiscValid        : Pos 27, 1 Bit
   +0x004 ErrorEnabled     : Pos 28, 1 Bit
   +0x004 UncorrectedError : Pos 29, 1 Bit
   +0x004 StatusOverFlow   : Pos 30, 1 Bit
   +0x004 Valid            : Pos 31, 1 Bit
   +0x000 QuadPart         : Uint8B

You’ll have to dump the other values using the .formats command and then comparing the MCi_STATUS structure to the bit values.

I’ve highlighted the GenuineIntel string since some users make the mistake of assuming automatically that the processor is at fault, and this is simply not true. The string is used to identify if the system is using a real Intel processor. For informational purposes, the string can be found in the following structure:

0: kd> dt nt!_KPRCB -y VendorString
   +0x4bb8 VendorString : [13] UChar

From this particular example, we can find the address of the PRCB by using the !prcb extension and then using the given address on the above mentioned structure.

0: kd> !prcb
PRCB for Processor 0 at fffff780ffff0000:
Current IRQL -- 15
Threads--  Current fffffa8007618b50 Next 0000000000000000 Idle fffff8000345fcc0
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 000b8afa
Times -- Dpc    00000100 Interrupt 00000023 
         Kernel 0002724c User      00000698
0: kd> dt nt!_KPRCB -y VendorString fffff780ffff0000
   +0x4bb8 VendorString : [13]  "GenuineIntel"

Okay, we have now established the meaning of the parameters, and have discovered that error was reported the processor through MCE. We will now need to dump the error record.

0: kd> !errrec fffffa80080d2028
===============================================================================
Common Platform Error Record @ fffffa80080d2028
-------------------------------------------------------------------------------
Record Id     : 01d090ac66cc4cb7
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 5/17/2015 15:10:51 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa80080d20a8
Section       @ fffffa80080d2180
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Instruction Execute
Flags         : 0x00
Level         : 0
CPU Version   : 0x00000000000106e5
Processor ID  : 0x0000000000000000

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa80080d20f0
Section       @ fffffa80080d2240
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000000
CPU Id        : e5 06 01 00 00 08 10 00 - fd e3 98 00 ff fb eb bf
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa80080d2240

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa80080d2138
Section       @ fffffa80080d22c0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : ICACHEL0_IRD_ERR (Proc 0 Bank 2)
  Status      : 0xf6000d8000040150
  Address     : 0x00000000004b6990
  Misc.       : 0x0000000000000000

The most important section is Section 2: x86/x64 MCA, since this section contains data specific to the data which the MCE would have reported to WHEA. We that error severity was fatal, thus leading to the creation of the bugcheck to begin with. I’ll go back to Section 0 in a moment. In Section 2, the Error field contains a mnemonic to type of error which is shown in Section 0. The mnemonic can be deciphered using the Intel processor documentation from following page 2352.

There is 4 different error classifications if you don’t consider the generic processor error type. These error classifications come to form what is known as a compound error code. In our example I’ve highlighted the sections which can vary and take different values depending upon the situation.

The current compound error classifications are:

Type Interpretation
Generic Cache Hierarchy Generic Cache Hierarchy Error
TLB Errors {TT}TLB{LL}_ERR
Memory Controller Errors {MMM}_CHANNEL_{CCCC}_ERR
Cache Hierarchy Errors {TT}CACHE{LL}_{RRRR}_ERR
Bus and Interconnect Errors BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR

I’ve taken Table 15-9 (Section 15.9.2) from the Intel documentation and cut it down into two sections, for our example we will be looking at the Cache Hierarchy Errors. Just to clarify, when dumping the error record and looking at the CPU mnemonic, attempt to decipher which error interpretation should be applied so your able to investigate the cause of the error in greater depth. The following mnemonics have been copied from the Intel documentation, I’ve added the section numbers if you wish to check yourself.

The {TT} variable is known as the Transaction Type (Section 15.9.2.2) and can take the following values:

  • I = Instruction
  • D = Data
  • G = Generic

In our particular type, the transaction type was an instruction, so we know that the error was based around the execution of some instruction.

The {LL} variable is known as the Level (Level of the Memory Hierarchy (Section 15.9.2.3)) and points to the type of cache which has experienced the error condition. It can take the following values:

  • L0 = Level 0
  • L1 = Level 1
  • L2 = Level 2
  • LG = Level Generic

The {RRRR} is known as the Request Type (Section 15.9.2.4) field and indicates the type of instruction or action which was being carried out at the time of the error. The variable can take the following values:

  • ERR = Generic Error
  • RD = Generic Read
  • WR = Generic Write
  • DRD = Data Read
  • DWR = Data Write
  • IRD = Instruction Fetch
  • PREFETCH = Prefetch
  • EVICT = Eviction
  • SNOOP = Snoop

I will quickly provide the meanings of the other sub-fields to save having to trawl through the Intel documentation. The {MMM} and {CCCC} fields primarily apply to Memory Controller errors. {MMM} is a 3-bit field called the Memory Transaction Type, whereas, {CCCC} is a 4-bit field for Channels. The memory controller error mnemonics can be found in Section 15.9.2.5.

The {MMM} field has the meanings of:

  • GEN = Generic undefined request
  • RD = Memory Read Error
  • WR = Memory Write Error
  • AC = Address/Command Error
  • MS = Memory Scrubbing Error

The {CCCC} field has one meaning which is CHN corresponds to the Channel Number.

Bus and Interconnect errors have three additional fields called {PP} for Participation; {T} for Timeout and {II} for I/O or Memory. The bus and interconnect error mnemonics can be found in Section 15.9.2.6.

{PP} defines how the processor participated within the request, and thus:

  • SRC = Local processor originated request
  • RES = Local processor responded to the request
  • OBS = Local processor observed the error as a third party

{T} defines if the processor requested for a timeout of the error or not:

  • TIMEOUT = Request timed out
  • NOTIMEOUT = Request didn’t time out

{II} defines the processor bus asked for memory access or I/O access.

  • M = Memory Access
  • I/O = IO

We have examined the error condition and have a good understanding of what the error is pertains to, however, we will need to gather some general hardware information to either check for patches or to provide greater troubleshooting information to the hardware manufacturer. There are several WinDbg extensions which enable us to achieve this.

2: kd> !sysinfo machineid
Machine ID Information [From Smbios 2.7, DMIVersion 39, Size=3456]
BiosMajorRelease = 4
BiosMinorRelease = 6
BiosVendor = American Megatrends Inc.
BiosVersion = 1005
BiosReleaseDate = 10/11/2012
SystemManufacturer = System manufacturer
SystemProductName = System Product Name
SystemFamily = To be filled by O.E.M.
SystemVersion = System Version
SystemSKU = SKU
BaseBoardManufacturer = ASUSTeK COMPUTER INC.
BaseBoardProduct = P8H77-M LE
BaseBoardVersion = Rev X.0x

The !sysinfo machineid extension can give motherboard and BIOS information, from here I would be able to check the motherboard documentation to ensure that the hardware is compatible and if any patches have been released to resolve any issues the user may be experiencing.

2: kd> !sysinfo cpuspeed
CPUID:        "Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz"
MaxSpeed:     3200
CurrentSpeed: 3192

The !sysinfo cpuspeed extension enables us to iinvestigate the clockspeed of the processor and if the user has been overclocking their processor. As commonly stated, overclocking can use system instablity and produce excessive heat production which could be affecting the normal operation of the system.

2: kd> !sysinfo cpuinfo
[CPU Information]
~MHz = REG_DWORD 3192
Component Information = REG_BINARY 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Configuration Data = REG_FULL_RESOURCE_DESCRIPTOR ff,ff,ff,ff,ff,ff,ff,ff,0,0,0,0,0,0,0,0
Identifier = REG_SZ Intel64 Family 6 Model 58 Stepping 9
ProcessorNameString = REG_SZ Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
Update Signature = REG_BINARY 0,0,0,0,12,0,0,0
Update Status = REG_DWORD 2
VendorIdentifier = REG_SZ GenuineIntel
MSR8B = REG_QWORD 1200000000

The !sysinfo cpuinfo extension provides some greater depth into the processor family and model, which is great when reporting a bug to Intel or AMD, since they will be to investigate the issue further and provide any insight into if the error is specific to a certain processor type. You may also noticed that I’m using a different dump file from previous part of this tutorial too!

The !sysinfo microcode provides similar information to the previous extension:

2: kd> !sysinfo cpumicrocode
Initial Microcode Version: 00000012:00000000
 Cached Microcode Version: 00000012:00000000
         Processor Family: 06
          Processor Model: 3a
       Processor Stepping: 09

Again, simply as personal preference, you will wish to dump the model information about all the processors on the system with !cpuinfo:

2: kd> !cpuinfo
CP  F/M/S Manufacturer  MHz PRCB Signature    MSR 8B Signature Features
 0  6,58,9 GenuineIntel 3192 0000001200000000                   21193ffe
 1  6,58,9 GenuineIntel 3192 0000001200000000                   21193ffe
 2  6,58,9 GenuineIntel 3192 0000001200000000                   21193ffe
 3  6,58,9 GenuineIntel 3192 0000001200000000                   21193ffe
                      Cached Update Signature 0000001200000000
                     Initial Update Signature 0000001200000000

Alternatively, you could use !cpuid which provides the exact same information:

2: kd> !cpuid
CP  F/M/S  Manufacturer     MHz
 0  6,58,9  GenuineIntel    3192
 1  6,58,9  GenuineIntel    3192
 2  6,58,9  GenuineIntel    3192
 3  6,58,9  GenuineIntel    3192

You can gather temperature information about the system through the use !tz and !tzinfo extensions, I won’t directly discuss the purpose of thermal zones and how they work in this tutorial since it would needlessly go out of scope and produce another page of writing. You can find more information about thermal zones in the ACPI documentation or through this discussion thread created by myself and Patrick when we first discovered the extensions.

2: kd> !tz
0 - ThermalZone @ 0xfffffa8004073310
  State:         Read                Flags:              0x00000002 Initialized
  Mode:          Active              PendingMode:        Active  
  ActivePoint:   0x00000002          PendingActivePoint: 0x00000002
  Throttle:      0x00000064
  SampleRate:    0x00000000          ThrottleReasons:    0
  LastTime:      0x0000000000000000  LastTemp:           0x00000000 (0.0K)
  PassiveTimer:  0xfffffa8004073340
  PassiveDpc:    0xfffffa8004073380
  OverThrottled: 0xfffffa80040733c0
  Irp:           0xfffffa8004680c80
  Device:        0x00000000
  Thermal Info:  0xfffffa80040733e0
1 - ThermalZone @ 0xfffffa8003679310
  State:         Read                Flags:              0x00000002 Initialized
  Mode:          Active              PendingMode:        Active  
  ActivePoint:   0x00000000          PendingActivePoint: 0x00000000
  Throttle:      0x00000064
  SampleRate:    0x00000000          ThrottleReasons:    0
  LastTime:      0x0000000000000000  LastTemp:           0x00000000 (0.0K)
  PassiveTimer:  0xfffffa8003679340
  PassiveDpc:    0xfffffa8003679380
  OverThrottled: 0xfffffa80036793c0
  Irp:           0xfffffa8004074310
  Device:        0x00000000
  Thermal Info:  0xfffffa80036793e0

The !tzinfo extension provides information about a specific thermal zone:

2: kd> !tzinfo 0xfffffa80036793e0
ThermalInfo @ 0xfffffa80036793e0
  Stamp:         0x00000007  Constant1:  0x00000001  Constant2:   0x00000005
  Period:        0x0000000a  ActiveCnt:  0x00000000  AffinityEx:  0xfffffa80036793f0
  Current Temperature:                   0x00000bd6 (303.0K)
  Passive TripPoint Temperature:         0x00000ed0 (379.2K)
  Hibernate TripPoint Temperature:       0x00000000 (0.0K)
  Critical TripPoint Temperature:        0x00000ed0 (379.2K)

PCIe Type Errors:

As shown before, we are going to dump the error record and then examine the relevant sections. The sections displayed with PCIe crashes are generally more lengthy and complex to understand. I would advise the full use of the PCIe documentation if available.

 

3: kd> !errrec 869348d4
===============================================================================
Common Platform Error Record @ 869348d4
-------------------------------------------------------------------------------
Record Id     : 01cd07d8bce4740f
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 3/22/2012 3:06:44 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ 86934954
Section       @ 869349e4
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Recoverable

Port Type     : Root Port
Version       : 1.1
Command/Status: 0x4010/0x0507
Device Id     :
  VenId:DevId : 8086:340a
  Class code  : 030400
  Function No : 0x00
  Device No   : 0x03
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ 86934a18
  Device Caps : 00008021 Role-Based Error Reporting: 1
  Device Ctl  : 0107 ur FE NF CE
  Dev Status  : 0003 ur fe NF CE
   Root Ctl   : 0008 fs nfs cs

AER Information @ ffffffff86934a54
  Uncorrectable Error Status    : 00000020 ur ecrc mtlp rof uc ca cto fcp ptlp SD dlp und
  Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
  Correctable Error Status      : 00000000 adv rtto rnro dllp tlp re
  Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
  Caps & Control                : 00000005 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
  Header Log                    : 00000000 00000000 00000000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00

===============================================================================
Section 1     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ 8693499c
Section       @ 86934ab4
Offset        : 480
Length        : 192
Flags         : 0x00000000
Severity      : Informational

Proc. Type    : x86/x64
Instr. Set    : x86
CPU Version   : 0x00000000000106a5
Processor ID  : 0x0000000000000006

Section 0 of the error record is the only important section of the error record with this form of bugcheck. I’ll start with identifying the device of the bugcheck using the Vendor ID and Device ID fields. The bugcheck indicates the error occurred at the Root Port, and from the PCI Database, we know that the device was the Intel I/O Hub PCIe Root Port. Unfortunately, this information is far too generic and doesn’t point to the exact cause of the bugcheck. From here, we would need to begin checking the PCIe devices which are connected to the system and how they interact with the other components.

To gather more information regarding the error, we need to investigate the meanings of the error codes shown in the AER; it is important to remember that we will be using the _PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure. The Uncorrectable Error Status is presented by the _PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS structure, which contains the bitfields for the error codes shown in the register. If we dump the this structure, then we can see that the captialised letters correspond to the bitfields which have been set to true.

3: kd> dt pci!_PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS
   +0x000 Undefined        : Pos 0, 1 Bit
   +0x000 Reserved1        : Pos 1, 3 Bits
   +0x000 DataLinkProtocolError : Pos 4, 1 Bit
   +0x000 SurpriseDownError : Pos 5, 1 Bit
   +0x000 Reserved2        : Pos 6, 6 Bits
   +0x000 PoisonedTLP      : Pos 12, 1 Bit
   +0x000 FlowControlProtocolError : Pos 13, 1 Bit
   +0x000 CompletionTimeout : Pos 14, 1 Bit
   +0x000 CompleterAbort   : Pos 15, 1 Bit
   +0x000 UnexpectedCompletion : Pos 16, 1 Bit
   +0x000 ReceiverOverflow : Pos 17, 1 Bit
   +0x000 MalformedTLP     : Pos 18, 1 Bit
   +0x000 ECRCError        : Pos 19, 1 Bit
   +0x000 UnsupportedRequestError : Pos 20, 1 Bit
   +0x000 Reserved3        : Pos 21, 11 Bits
   +0x000 AsULONG          : Uint4B

Since the Suprise Down (SD) error bitfield is the also one which has been set, then we can investigate further into what a exactly a Surprise Down error is. In short, it indicates a loss of connection between two devices, although, I will give a slightly more detailed defintion with the use of the PCIe documentation. I’ve added the section numbers for reference.

A Surprise Down error occurs when a TLP (Transaction Layer Protocol) request packet is sent numerous times to a device across a link, and then device doesn’t respond positively. TLP’s are similar to IRPs and are present within the Transaction Layer (Section 2) of the PCIe topology, which is responsible for issuing and responding to TLPs.

For those experienced with debugging, you can imagine this situation as a Stop 0x9F, a IRP is sent but becomes stuck for some unknown reason. Once a threshold can be met, then the link is considered to be inactive or malfunctioning and thus a bugcheck is raised to alert the operating system of this error. The best methodology for this type of error would be to investigate the connections of the devices on the motherboard; check for any loosely seated cards and dust which may have built up inside the slots.

Moreover, simply as a matter of interest, the Bus Number, Device Number and Function Number are used to map a device into the PCI Configuration Space. We use can the !pci extension to view such information, but please note that you will require a live debugging session with a x86 computer.

lkd> !pci
PCI Segment 0 Bus 0
00:0  1022:1510.00  Cmd[0006:.mb...]  Sts[0220:.6...]  AMD Host Bridge  SubID:1022:1510
01:0  1002:9806.00  Cmd[0407:imb...]  Sts[0010:c....]  ATI VGA Compatible Controller  SubID:103c:3387
01:1  1002:1314.00  Cmd[0006:.mb...]  Sts[0010:c....]  ATI Class:4:3:0  SubID:103c:3387
04:0  1022:1512.00  Cmd[0004:..b...]  Sts[0010:c....]  AMD PCI-PCI Bridge 0->0x1-0x1
11:0  1002:4394.00  Cmd[0007:imb...]  Sts[0230:c6...]  ATI Class:1:6:1  SubID:103c:3387
12:0  1002:4397.00  Cmd[0016:.mb...]  Sts[02a0:.6...]  ATI USB Controller  SubID:103c:3387
12:2  1002:4396.00  Cmd[0016:.mb...]  Sts[02b0:c6...]  ATI USB2 Controller  SubID:103c:3387
13:0  1002:4397.00  Cmd[0016:.mb...]  Sts[02a0:.6...]  ATI USB Controller  SubID:103c:3387
13:2  1002:4396.00  Cmd[0016:.mb...]  Sts[02b0:c6...]  ATI USB2 Controller  SubID:103c:3387
14:0  1002:4385.42  Cmd[0403:im....]  Sts[0220:.6...]  ATI SMBus Controller  SubID:103c:3387
14:2  1002:4383.40  Cmd[0006:.mb...]  Sts[0410:c....]  ATI Class:4:3:0  SubID:103c:3387
14:3  1002:439d.40  Cmd[000f:imb...]  Sts[0220:.6...]  ATI ISA Bridge  SubID:103c:3387
14:4  1002:4384.40  Cmd[0407:imb...]  Sts[02a0:.6...]  ATI PCI-PCI Bridge 0->0x2-0x2
15:0  1002:43a0.00  Cmd[0007:imb...]  Sts[0810:c..A.]  ATI PCI-PCI Bridge 0->0x3-0x6
15:1  1002:43a1.00  Cmd[0007:imb...]  Sts[0010:c....]  ATI PCI-PCI Bridge 0->0x7-0x7
16:0  1002:4397.00  Cmd[0016:.mb...]  Sts[02a0:.6...]  ATI USB Controller  SubID:103c:3387
16:2  1002:4396.00  Cmd[0016:.mb...]  Sts[02b0:c6...]  ATI USB2 Controller  SubID:103c:3387
18:0  1022:1700.43  Cmd[0000:......]  Sts[0010:c....]  AMD Host Bridge
18:1  1022:1701.00  Cmd[0000:......]  Sts[0000:.....]  AMD Host Bridge
18:2  1022:1702.00  Cmd[0000:......]  Sts[0000:.....]  AMD Host Bridge
18:3  1022:1703.00  Cmd[0000:......]  Sts[0010:c....]  AMD Host Bridge
18:4  1022:1704.00  Cmd[0000:......]  Sts[0000:.....]  AMD Host Bridge
18:5  1022:1718.00  Cmd[0000:......]  Sts[0000:.....]  AMD Host Bridge
18:6  1022:1716.00  Cmd[0000:......]  Sts[0000:.....]  AMD Host Bridge
18:7  1022:1719.00  Cmd[0000:......]  Sts[0000:.....]  AMD Host Bridge

The first column indicates the device number and the second column shows the function number of that particular device. We can gather even further information using the !pcitree extension which shows all the devices which have enumerated on the bus:

 

lkd> !pcitree
Bus 0x0 (FDO Ext 86688ae0)
  (d=0,  f=0) 10221510 devext 0x8665cc10 devstack 0x8665cb58 0600 Bridge/HOST to PCI
  (d=1,  f=0) 10029806 devext 0x8665c738 devstack 0x8665c680 0300 Display Controller/VGA
  (d=1,  f=1) 10021314 devext 0x866720e8 devstack 0x86672030 0403 Multimedia Device/Unknown Sub Class
  (d=4,  f=0) 10221512 devext 0x86672c10 devstack 0x86672b58 0604 Bridge/PCI to PCI
  Bus 0x1 (FDO Ext 86680d18)
    No devices have been enumerated on this bus.
  (d=11, f=0) 10024394 devext 0x86672738 devstack 0x86672680 0106 Mass Storage Controller/Unknown Sub Class
  (d=12, f=0) 10024397 devext 0x866730e8 devstack 0x86673030 0c03 Serial Bus Controller/USB
  (d=12, f=2) 10024396 devext 0x86673c10 devstack 0x86673b58 0c03 Serial Bus Controller/USB
  (d=13, f=0) 10024397 devext 0x86673738 devstack 0x86673680 0c03 Serial Bus Controller/USB
  (d=13, f=2) 10024396 devext 0x866740e8 devstack 0x86674030 0c03 Serial Bus Controller/USB
  (d=14, f=0) 10024385 devext 0x86674c10 devstack 0x86674b58 0c05 Serial Bus Controller/Unknown Sub Class
  (d=14, f=2) 10024383 devext 0x86674738 devstack 0x86674680 0403 Multimedia Device/Unknown Sub Class
  (d=14, f=3) 1002439d devext 0x866750e8 devstack 0x86675030 0601 Bridge/PCI to ISA
  (d=14, f=4) 10024384 devext 0x86675c10 devstack 0x86675b58 0604 Bridge/PCI to PCI
  Bus 0x2 (FDO Ext 86685888)
    No devices have been enumerated on this bus.
  (d=15, f=0) 100243a0 devext 0x86675738 devstack 0x86675680 0604 Bridge/PCI to PCI
  Bus 0x3 (FDO Ext 866853e0)
    (d=0,  f=0) 14e44727 devext 0x86a787c8 devstack 0x86a78710 0280 Network Controller/'Other'
  (d=15, f=1) 100243a1 devext 0x8667c0e8 devstack 0x8667c030 0604 Bridge/PCI to PCI
  Bus 0x7 (FDO Ext 8668fea8)
    (d=0,  f=0) 10ec8168 devext 0x86a7dc10 devstack 0x86a7db58 0200 Network Controller/Ethernet
  (d=16, f=0) 10024397 devext 0x8667cc10 devstack 0x8667cb58 0c03 Serial Bus Controller/USB
  (d=16, f=2) 10024396 devext 0x8667c738 devstack 0x8667c680 0c03 Serial Bus Controller/USB
  (d=18, f=0) 10221700 devext 0x8667d0e8 devstack 0x8667d030 0600 Bridge/HOST to PCI
  (d=18, f=1) 10221701 devext 0x8667dc10 devstack 0x8667db58 0600 Bridge/HOST to PCI
  (d=18, f=2) 10221702 devext 0x8667d738 devstack 0x8667d680 0600 Bridge/HOST to PCI
  (d=18, f=3) 10221703 devext 0x8667e0e8 devstack 0x8667e030 0600 Bridge/HOST to PCI
  (d=18, f=4) 10221704 devext 0x8667ec10 devstack 0x8667eb58 0600 Bridge/HOST to PCI
  (d=18, f=5) 10221718 devext 0x8667e738 devstack 0x8667e680 0600 Bridge/HOST to PCI
  (d=18, f=6) 10221716 devext 0x8667f0e8 devstack 0x8667f030 0600 Bridge/HOST to PCI
  (d=18, f=7) 10221719 devext 0x8667fc10 devstack 0x8667fb58 0600 Bridge/HOST to PCI
Total PCI Root busses processed = 1
Total PCI Segments processed = 1

The D represents the device number, the F represents the function number, and the first block of highlighted charachters indicates the Device ID with the subsequent block of characters being used to show the Vendor ID of the device.

lkd> !devext 0x8665cc10
PDO Extension, Bus 0x0, Device 0, Function 0.
  DevObj 0x8665cb58  Parent FDO DevExt 0x86688ae0
  Device State = PciStarted
  Vendor ID 1022 (ADVANCED MICRO DEVICES)  Device ID 1510
  Subsystem Vendor ID 1022 (ADVANCED MICRO DEVICES)  Subsystem ID 1510
  Header Type 0, Class Base/Sub 06/00  (Bridge/HOST to PCI)
  Programming Interface: 00, Revision: 00, IntPin: 00, RawLine 00
  Possible Decodes ((cmd & 7) = 7): BMI   Capabilities: Ptr = <none>
  Logical Device Power State: D0
  Device Wake Level:          Unspecified
  WaitWakeIrp:                <none>
  Device Requirements structure has changed size.  Update extension.
  Device Resources structure has changed size.  Update extension.
  Interrupt Requirement: <none>
  Interrupt Resource: <none>

The !devext extension can provide some additional information about the device, which is useful for debugging purposes. There another field within the error record which I had forgotten to mention, and that is the Class Code register.

The register is very useful for identifying the type of device which could be causing the problem. The register is divided into three different parts: Class, Sub-Class and Prog. I/F. From our example, we can see that the class is 0x3, the sub-class is 0x4 and the Prog. I/F is 0x0. If we were to check the meanings of these values, then we would reach the conclusion a display controller had reported the error to the operating system.

0x3 is the class number for display controllers, and the sub-class points to general category of display controllers. This makes sense in the context of this dump since the issue lied with a TV tuner card (the dump was previously debugged by Vir Gnarus). A complete list of PCI Class codes can be found here – http://wiki.xomb.org/index.php?title=PCI_Class_Codes

I hope this tutorial series given a in-depth insight into the internals of a Stop 0x124 and some of the debugging methodologies we could use to debug such bugchecks. I didn’t wish to delve too deeply into the technical details of PCIe and x86/x64 architectures since it would leave the scope of this tutorial and generate too much ‘fluff’ and ‘filler’.

I hope you enjoyed this tutorial, and if you wish to suggest any amendments or corrections then please comment/post below. Moreover, please note that I’ve created a list of reference material regarding PCI-e and CPU architecture in this thread – Hardware Architecture Documentation Links

Advertisements
Posted in Computer Science, Debugging, Stop 0x124, WinDbg, Windows Internals | 1 Comment

The Complete Debugging Guide to Stop 0x124 – Part 2

We looked at the error packets in the first part, but now we need to investigate the structure of error records, and understand how to gather useful information from these error records to assist us in our debugging efforts. There is primarily two forms of error records you will see, one relates to processor type errors and the other corresponds to PCIe errors. Both have different error record sections which we will need to examine.

The following diagram illustrates the general structure of an error record:

IC535128
We’ll start at the top of the error record, and then describe each section individually.

The general error record is described by the WHEA_ERROR_RECORD structure, which can be produced in WinDbg:

2: kd> dt nt!_WHEA_ERROR_RECORD
   +0x000 Header           : _WHEA_ERROR_RECORD_HEADER
   +0x080 SectionDescriptor : [1] _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR

The SectionDescriptor field is an array of WHEA_ERROR_RECORD_SECTION_DESCRIPTOR structures which describe each error section; there must be at least one error section for a error record, thus the reason why 1 element is always present within the array when we dump the error record in an arbitrary context.

Error Record Header:

The error record header is described by the WHEA_ERROR_RECORD_HEADER structure as seen below:

2: kd> dt nt!_WHEA_ERROR_RECORD_HEADER
   +0x000 Signature        : Uint4B
   +0x004 Revision         : _WHEA_REVISION
   +0x006 SignatureEnd     : Uint4B
   +0x00a SectionCount     : Uint2B
   +0x00c Severity         : _WHEA_ERROR_SEVERITY
   +0x010 ValidBits        : _WHEA_ERROR_RECORD_HEADER_VALIDBITS
   +0x014 Length           : Uint4B
   +0x018 Timestamp        : _WHEA_TIMESTAMP
   +0x020 PlatformId       : _GUID
   +0x030 PartitionId      : _GUID
   +0x040 CreatorId        : _GUID
   +0x050 NotifyType       : _GUID
   +0x060 RecordId         : Uint8B
   +0x068 Flags            : _WHEA_ERROR_RECORD_HEADER_FLAGS
   +0x06c PersistenceInfo  : _WHEA_PERSISTENCE_INFO
   +0x074 Reserved         : [12] UChar

There isn’t too much to mention about the Error Record Header, apart from that I’ve highlighted the most useful fields to be discussed. The Signature field will contain the value of REPC, which might be useful to consider if you were looking at raw memory dumps and noticed that signature. The SectionCount is the number of error sections within the error record, there must be at least one error section.

The Severity field of the error record contains the _WHEA_ERROR_SEVERITY enumeration, this describes the severity of the hardware error which has occurred.

0: kd> dt nt!_WHEA_ERROR_SEVERITY
   WheaErrSevRecoverable = 0n0
   WheaErrSevFatal = 0n1
   WheaErrSevCorrected = 0n2
   WheaErrSevInformational = 0n3

The most common values are Fatal and Recoverable. They can be used to give an indication of how severe the error condition, and which action should be taken. The meanings of the values can be found on the MSDN website.

Error Record Section Descriptor:

0: kd> dt nt!_WHEA_ERROR_RECORD_SECTION_DESCRIPTOR
   +0x000 SectionOffset    : Uint4B
   +0x004 SectionLength    : Uint4B
   +0x008 Revision         : _WHEA_REVISION
   +0x00a ValidBits        : _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_VALIDBITS
   +0x00b Reserved         : UChar
   +0x00c Flags            : _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_FLAGS
   +0x010 SectionType      : _GUID
   +0x020 FRUId            : _GUID
   +0x030 SectionSeverity  : _WHEA_ERROR_SEVERITY
   +0x034 FRUText          : [20] Char

The SectionType field contains the GUID for the error sections, and can take the following values:

  • Hardware Error Packet
  • Generic Processor Error
  • x86/x64 Processor Error
  • Itanium Processor Error
  • Itanium Processor Firmware Error Record Reference
  • Platform Memory Error
  • Nonmaskable Interrupt
  • PCI Express Error
  • PCI/PCI-X Bus Error
  • PCI/PCI-X Device Error

Before I begin to explain the meaning of the error sections listed, let’s examine briefly the error sections which you will most commonly see when examining error records. I will separate the error records into two types: processor related errors and PCI/PCIe related errors.

Processor Related Error Records:

The Generic Processor Error section is described by the below structure:

0: kd> dt hal!_WHEA_PROCESSOR_GENERIC_ERROR_SECTION
   +0x000 ValidBits        : _WHEA_PROCESSOR_GENERIC_ERROR_SECTION_VALIDBITS
   +0x008 ProcessorType    : UChar
   +0x009 InstructionSet   : UChar
   +0x00a ErrorType        : UChar
   +0x00b Operation        : UChar
   +0x00c Flags            : UChar
   +0x00d Level            : UChar
   +0x00e Reserved         : Uint2B
   +0x010 CPUVersion       : Uint8B
   +0x018 CPUBrandString   : [128] UChar
   +0x098 ProcessorId      : Uint8B
   +0x0a0 TargetAddress    : Uint8B
   +0x0a8 RequesterId      : Uint8B
   +0x0b0 ResponderId      : Uint8B
   +0x0b8 InstructionPointer : Uint8B

As usual, I’ve highlight the fields which are the most useful for debugging purposes, although, it is quite interesting learning the meaning of the other fields anyway. This section is primarily used to provide information which is applicable across different processor architectures.

ProcessorType describes the processor architecture, this field currently takes the value of GENPROC_PROCTYPE_XPF (x86/x64) or GENPROC_PROCTYPE_IPF (Itanium). The InstructionSet field describes the instruction set which was being currently used at the time of crash, the current values are GENPROC_PROCISA_X86 or GENPROC_PROCISA_X64 for x86/x64 systems, and GENPROC_PROCISA_IPF Itanium systems.

ErrorType gives an indication of the type of error which has occured, this may be a TLB Cache Error, Bus Error or another Cache error. The field can take the following values: GENPROC_PROCERRTYPE_UNKNOWN; GENPROC_PROCERRTYPE_CACHE; GENPROC_PROCERRTYPE_TLB (Translation Lookaside Buffer); GENPROC_PROCERRTYPE_BUS; GENPROC_PROCERRTYPE_MAE (Microarchitecture error).

The Level field corresponds to the current cache level, where the error has occured. The CPUVersion is a union called WHEA_PROCESSOR_FAMILY_INFO which describes the stepping, family and model for the processor. There a far easiest method to obtain this information though, and I will show the extension in Part 3. The ProcessorId is simply the logical processor number where the error was reported.

Please for the above described fields to be valid and present within the error record section, then the corresponding bits must be set within the following union called WHEA_PROCESSOR_GENERIC_ERROR_SECTION_VALIDBITS:

0: kd> dt hal!_WHEA_PROCESSOR_GENERIC_ERROR_SECTION_VALIDBITS
   +0x000 ProcessorType    : Pos 0, 1 Bit
   +0x000 InstructionSet   : Pos 1, 1 Bit
   +0x000 ErrorType        : Pos 2, 1 Bit
   +0x000 Operation        : Pos 3, 1 Bit
   +0x000 Flags            : Pos 4, 1 Bit
   +0x000 Level            : Pos 5, 1 Bit
   +0x000 CPUVersion       : Pos 6, 1 Bit
   +0x000 CPUBrandString   : Pos 7, 1 Bit
   +0x000 ProcessorId      : Pos 8, 1 Bit
   +0x000 TargetAddress    : Pos 9, 1 Bit
   +0x000 RequesterId      : Pos 10, 1 Bit
   +0x000 ResponderId      : Pos 11, 1 Bit
   +0x000 InstructionPointer : Pos 12, 1 Bit
   +0x000 Reserved         : Pos 13, 51 Bits
   +0x000 ValidBits        : Uint8B

The x86/x64 Processor Error section is described by the structure below:

0: kd> dt hal!_WHEA_XPF_PROCESSOR_ERROR_SECTION
   +0x000 ValidBits        : _WHEA_XPF_PROCESSOR_ERROR_SECTION_VALIDBITS
   +0x008 LocalAPICId      : Uint8B
   +0x010 CpuId            : [48] UChar
   +0x040 VariableInfo     : [1] UChar

This section is used to format any information which is specific to that particular processor architecture. The most interesting and useful field is the CpuId or CPUID. This section contains the stepping, model and family version numbers for the processor. The following illustrates an example output when the !errrec extension has been used:

CPU Id        : e5 06 01 00 00 08 10 00 - fd e3 98 00 ff fb eb bf
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

The Stepping is 5, the Model is 1e (30) and the Family is 6. We could have additionally used the !cpuid extension to obtain the same information:

0: kd> !cpuid
CP  F/M/S  Manufacturer     MHz
 0  6,30,5  GenuineIntel    1729

If you wondering why the Model is given as 30 and not 1e, then it because 1e is 30 in decimal.

0: kd> ? 1e
Evaluate expression: 30 = 00000000`0000001e

PCI/PCIe Related Error Records:

The PCI Express error section is described by the _WHEA_PCIEXPRESS_ERROR_SECTION,the structure does contain a number of interesting fields, however, exploring all these sub-structures would be outside the scope of this tutorial and would require specialist knowledge of the PCIe technology.

3: kd> dt PSHED!_WHEA_PCIEXPRESS_ERROR_SECTION
   +0x000 ValidBits        : _WHEA_PCIEXPRESS_ERROR_SECTION_VALIDBITS
   +0x008 PortType         : _WHEA_PCIEXPRESS_DEVICE_TYPE
   +0x00c Version          : _WHEA_PCIEXPRESS_VERSION
   +0x010 CommandStatus    : _WHEA_PCIEXPRESS_COMMAND_STATUS
   +0x014 Reserved         : Uint4B
   +0x018 DeviceId         : _WHEA_PCIEXPRESS_DEVICE_ID
   +0x028 DeviceSerialNumber : Uint8B
   +0x030 BridgeControlStatus : _WHEA_PCIEXPRESS_BRIDGE_CONTROL_STATUS
   +0x034 ExpressCapability : [60] UChar
   +0x070 AerInfo          : [96] UChar

I’ve highlighted the most important fields, which are also automatically parsed by WinDbg by using the !errrec extension. The PortType describes the PCIe port of where the error occured, it is a enumeration of the following values:

3: kd> dt PSHED!_WHEA_PCIEXPRESS_DEVICE_TYPE
   WheaPciExpressEndpoint = 0n0
   WheaPciExpressLegacyEndpoint = 0n1
   WheaPciExpressRootPort = 0n4
   WheaPciExpressUpstreamSwitchPort = 0n5
   WheaPciExpressDownstreamSwitchPort = 0n6
   WheaPciExpressToPciXBridge = 0n7
   WheaPciXToExpressBridge = 0n8
   WheaPciExpressRootComplexIntegratedEndpoint = 0n9
   WheaPciExpressRootComplexEventCollector = 0n10

A port can have two different definitions, depending upon if you look at the PCIe interface from a logical or physical standpoint; physically, a port is a collection of transmitters and receivers PCI requests which create a link, whereas, logically, a port is an interface between a component and a link. A link is simply a path of communication between two different devices.

The main PortType you’ll see mentioned in the PCIe dumps is the Root Port, the following definition is an extract from PCI Express Base 1.1 Specification:

“A PCI Express Port on a Root Complex that maps a portion of the Hierarchy through an associated virtual PCI-PCI Bridge”

Hierarchy simply means any component within the tree which represents the different components and layers of PCIe.

The next field is DeviceId which is much more simpler to understand than the previous example, this structure describes the Vendor ID and Device ID of the PCI/PCIe device which may be experiencing problems.

3: kd> dt PSHED!_WHEA_PCIEXPRESS_DEVICE_ID
   +0x000 VendorID         : Uint2B
   +0x002 DeviceID         : Uint2B
   +0x004 ClassCode        : Pos 0, 24 Bits
   +0x004 FunctionNumber   : Pos 24, 8 Bits
   +0x008 DeviceNumber     : Pos 0, 8 Bits
   +0x008 Segment          : Pos 8, 16 Bits
   +0x008 PrimaryBusNumber : Pos 24, 8 Bits
   +0x00c SecondaryBusNumber : Pos 0, 8 Bits
   +0x00c Reserved1        : Pos 8, 3 Bits
   +0x00c SlotNumber       : Pos 11, 13 Bits
   +0x00c Reserved2        : Pos 24, 8 Bits

You can enter the Vendor ID and Device ID into a PCI Database and it should give you the device name.

The last important field for debugging is the AerInfo, which is actually represented by the following structure:

3: kd> dt pci!_PCI_EXPRESS_AER_CAPABILITY
   +0x000 Header           : _PCI_EXPRESS_ENHANCED_CAPABILITY_HEADER
   +0x004 UncorrectableErrorStatus : _PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS
   +0x008 UncorrectableErrorMask : _PCI_EXPRESS_UNCORRECTABLE_ERROR_MASK
   +0x00c UncorrectableErrorSeverity : _PCI_EXPRESS_UNCORRECTABLE_ERROR_SEVERITY
   +0x010 CorrectableErrorStatus : _PCI_EXPRESS_CORRECTABLE_ERROR_STATUS
   +0x014 CorrectableErrorMask : _PCI_EXPRESS_CORRECTABLE_ERROR_MASK
   +0x018 CapabilitiesAndControl : _PCI_EXPRESS_AER_CAPABILITIES
   +0x01c HeaderLog        : [4] Uint4B
   +0x02c SecUncorrectableErrorStatus : _PCI_EXPRESS_SEC_UNCORRECTABLE_ERROR_STATUS
   +0x030 SecUncorrectableErrorMask : _PCI_EXPRESS_SEC_UNCORRECTABLE_ERROR_MASK
   +0x034 SecUncorrectableErrorSeverity : _PCI_EXPRESS_SEC_UNCORRECTABLE_ERROR_SEVERITY
   +0x038 SecCapabilitiesAndControl : _PCI_EXPRESS_SEC_AER_CAPABILITIES
   +0x03c SecHeaderLog     : [4] Uint4B

I’ve highlighted the two fields which are the most important when this structure is parsed by WinDbg. The two structures indicate the types of errors which have reported on that device. The errors are:

  • UR – Unsupported Request Error
  • MTLP – Malformed TLP
  • SD – Surprise Down
  • ROF – Receiver Overflow
  • UC – Unexcepted Completion
  • CT – Completion Timeout
  • DLP – Data Link Protocol Error
  • PTLP – Poisoned TLP
  • FCP – Flow Control Protocol Error
  • CA – Completer Abort
  • ECRC – End to End Reduncany Check Error

I’ve given a description of these errors before in another tutorial series on my blog called Debugging Stop 0x124 PCIe Errors Part 1-3. The captialisation of the letters indicates the type of error which has occured, more information about this will be given in Part 3 of this debugging tutorial series.

Please note the Root Port tyoe errors use the PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure instead.

In Part 3, we’ll begin to look at the debugging methodology involved for both types of bugchecks.

Posted in Computer Science, Debugging, Stop 0x124, WinDbg, Windows Internals | Leave a comment

The Complete Debugging Guide to Stop 0x124 – Part 1

Introduction:

The Stop 0x124 is mostly caused by hardware, and in some exceptional cases, can be potentially caused by buggy device drivers. There isn’t much of a debugging methodology to debugging a Stop 0x124, but there is plenty of background information which would be useful for understanding some of the terminology witnessed within a Stop 0x124 bugcheck.

A failure of a Stop 0x124 to be successfully created, usually produces a Stop 0x122, a debugging tutorial for Stop 0x122 can be found here – Debugging Stop 0x122 – WHEA_INTERNAL_ERROR

Background:

WHEA (Windows Hardware Error Architecture) was introduced on Windows Vista and Windows Server 2008, to provide a effective error reporting system which would make debugging more effective, and take precedence over the MCA (Machine Check Architecture) as a primary error reporting architecture for hardware devices. MCA and MCE do still exist on Windows Vista and later operating systems, but are delivered through WHEA instead.

Structure of WHEA:

WHEA consists of a number of different components, the main concepts are LLHEHs (Low-Level Hardware Error Handler), PSHEDs (Platform-Specific Hardware Error Driver) and WHEA error records. The following diagram obtained from the Microsoft documentation provides an overview of how these components interact with the rest of the operating system:

WHEA
The LLHEH is the first component which would handle the error discovered by the error source. Error sources will discussed later in this guide, but for now, I will simply mention that the error source is the hardware component which discovered the hardware error, and does not mean where the error originated from. The following flow diagram will hopefully help to illustrate the entire WHEA process.

Hardware Error -> Error Source Alerts OS -> LLHEH for corresponding error source is invoked -> Error Packet is created -> Error Packet is processed into a Error Record -> Error Record is processed by PSHED -> Bugcheck is produced

It is important to note that the above flow diagram is rather crude and doesn’t necessarily show the details of each process involved in the WHEA bugchecking process. Please note it also only illustrates what happens with a fatal hardware error, something which will only lead to a bugcheck.

I will now begin to discuss Error Sources, and their purpose within a WHEA bugcheck. To begin, we need to understand and identify that the first parameter of the Stop 0x124 is the value of the error source.

2: kd> .bugcheck
Bugcheck code 00000124
Arguments 00000000`00000000 fffffa80`04ba6028 00000000`be000000 00000000`00800400

All error sources are stored within a enumeration called WHEA_ERROR_SOURCE_TYPE. This enumeration can be used to find the name of the error source. There are currently 13 different error sources. The most common being MCE (0x0) and PCIe (0x4).

2: kd> dt nt!_WHEA_ERROR_SOURCE_TYPE
   WheaErrSrcTypeMCE = 0n0
   WheaErrSrcTypeCMC = 0n1
   WheaErrSrcTypeCPE = 0n2
   WheaErrSrcTypeNMI = 0n3
   WheaErrSrcTypePCIe = 0n4
   WheaErrSrcTypeGeneric = 0n5
   WheaErrSrcTypeINIT = 0n6
   WheaErrSrcTypeBOOT = 0n7
   WheaErrSrcTypeSCIGeneric = 0n8
   WheaErrSrcTypeIPFMCA = 0n9
   WheaErrSrcTypeIPFCMC = 0n10
   WheaErrSrcTypeIPFCPE = 0n11
   WheaErrSrcTypeMax = 0n12

Our current error source type is the Machine Check Exception. The error source alerts the operating system of a hardware error, and when done so, the corresponding LLHEH will be ran to handle that error condition. The LLHEH isn’t actucally a separate entitiy which exists, it is simply a category of handlers, and thus a LLHEH can be a range of handlers, including interrupt handlers, exception handlers or callback functions. The LLHEH will process the error condition into a error packet, and then alert the operating system of the hardware condition.

2: kd> .frame /r 3
03 fffff880`02f6db00 fffff800`02c26052 hal!HalpMcaReportError+0x4c
rax=0000000000000000 rbx=fffffa8004c17ea0 rcx=0000000000000124
rdx=0000000000000000 rsi=fffff88002f6de00 rdi=fffffa8004c17ef0
rip=fffff80002c26700 rsp=fffff88002f6db00 rbp=fffff88002f6de30
 r8=fffffa8004ba6028  r9=00000000be000000 r10=0000000000800400
r11=0000000000000002 r12=00000000ffffff02 r13=0000000000000000
r14=0000000000000000 r15=0000000000000001
iopl=0         ov up ei pl nz na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000a06
hal!HalpMcaReportError+0x4c:
fffff800`02c26700 488b8c2430010000 mov     rcx,qword ptr [rsp+130h] ss:0018:fffff880`02f6dc30=ffff00906cfd8774

The hal!HalpMcaReportError+0x4c is the LLHEH for this current bugcheck, notice the bugcheck information stored within the status registers for the stack frame?

As mentioned previously, a LLHEH will produce a error packet, which in turn can be investigated by the debugger.

Each error packet is represented by the WHEA_ERROR_PACKET macro, and there is currently two different versions: WHEA_ERROR_PACKET_V1 and WHEA_ERROR_PACKET_V2. The V1 type is supported by Windows Vista SP1 and Windows Server 2008; V2 is supported by from Windows 7 and all latter operating systems.

Version 2:

The only difference between the two structures, is the Signature member. The Signature member takes the value of WHEA_ERROR_PACKET_V2_SIGNATURE for Version 2 or WHEA_ERROR_PACKET_V1_SIGNATURE for Version 1. Since Windows Vista systems are pretty much obsolete now, there isn’t any real reason to bother examining the Version 1 structure.

2: kd> dt _WHEA_ERROR_PACKET_V2
nt!_WHEA_ERROR_PACKET_V2
   +0x000 Signature        : Uint4B
   +0x004 Version          : Uint4B
   +0x008 Length           : Uint4B
   +0x00c Flags            : _WHEA_ERROR_PACKET_FLAGS
   +0x010 ErrorType        : _WHEA_ERROR_TYPE
   +0x014 ErrorSeverity    : _WHEA_ERROR_SEVERITY
   +0x018 ErrorSourceId    : Uint4B
   +0x01c ErrorSourceType  : _WHEA_ERROR_SOURCE_TYPE
   +0x020 NotifyType       : _GUID
   +0x030 Context          : Uint8B
   +0x038 DataFormat       : _WHEA_ERROR_PACKET_DATA_FORMAT
   +0x03c Reserved1        : Uint4B
   +0x040 DataOffset       : Uint4B
   +0x044 DataLength       : Uint4B
   +0x048 PshedDataOffset  : Uint4B
   +0x04c PshedDataLength  : Uint4B

The most important members of the data structure are: Error Type, ErrorSourceType and NotifyType.

The ErrorType field contains the WHEA_ERROR_TYPE structure which describes the hardware which reported the error.

2: kd> dt nt!_WHEA_ERROR_TYPE
   WheaErrTypeProcessor = 0n0
   WheaErrTypeMemory = 0n1
   WheaErrTypePCIExpress = 0n2
   WheaErrTypeNMI = 0n3
   WheaErrTypePCIXBus = 0n4
   WheaErrTypePCIXDevice = 0n5
   WheaErrTypeGeneric = 0n6

The ErrorSourceType has been explained earlier in this post. The NotifyType is the type of mechanism which reports the error to the operating system; for example MCE or BOOT. The _GUID is given the following values:

  • CMC_NOTIFY_TYPE_GUID
  • CPE_NOTIFY_TYPE_GUID
  • MCE_NOTIFY_TYPE_GUID
  • PCIe_NOTIFY_TYPE_GUID
  • INIT_NOTIFY_TYPE_GUID
  • NMI_NOTIFY_TYPE_GUID
  • BOOT_NOTIFY_TYPE_GUID

We can examine WHEA Error Packets using the !errpkt extension, but unfortunately that requires a WHEA Error Record with the Error Record Section named Error Packet/Hardware Error Packet. I started debugging in 2012, and I still haven’t seen a BSOD where !errpkt has worked.

Posted in Uncategorized | 1 Comment

Blog Title Change – BSOD Tutorials to Machines Can Think

I apologise for the lack of posting recently, however, there have been several reasons for the lack of writing on this blog which I will outline below. On the other hand, in case I haven’t explained this already, a few months ago (when I was active) I changed the title of this blog from BSOD Tutorials to Machines Can Think; I was going to also change the URL but decided it for traffic and continuity reasons.

The main reason for the title change, was the content of the blog has changed and diversified due to the range of topics which interest me, and therefore wish to write about on this blog. The number of interests has since increased, and I am now considering creating another blog for current affairs, but that is debatable and I’m still deciding over if I should create it or not.

Anyhow, the reasons why I haven’t written in a while is: most of the writing projects I have in draft are quite large and time-consuming; working full-time and studying full-time gives me very limited leisure time, and difficult family circumstances have taken priority over my writing.

I must say to avoid any confusion of any narcissism, I don’t consider myself to be proficient in writing at all,  when I speak about the word “writing”, I’m not trying to put any particular emphasis on the quality of my writing and proclaim that I’m great at writing tutorials and blog posts. I simply wish to state the activity which I enjoy, can be rather time-consuming and difficult sometimes. I always apply the same principle when discussing topics like science, mathematics or philosophy, I read these topics regularly but I don’t like to take the stance that I’m well versed in these topics, even if others might consider me to actually be. I think it comes from my particular gripe I have with people self proclaiming that their an expert in an area, which they very may well be, however, I feel it can come across as bigoted and vain; it’s best to let others to say that about you themselves. On the other hand, the word “considered” can be deemed acceptable, especially when used with “by others”.

I will strive to have something written this week, most likely related to WinDbg and additionally a post about Maths.

 

 

 

Posted in Uncategorized | Leave a comment

Superfish – There’s Nothing Super About It

Lenovo has recently been given some bad press about its bundled software (more commonly called bloatware) which is being shipped with Lenovo systems. The software which is causing a large concern is Superfish. An add-on which is supposedly designed to enhance our online shopping experience and provide suggestions about products and services which we don’t want to purchase to begin with. In short, it may seem like another form of typical adware which is bundled with most OEM released computers. However, there is a rather nasty twist to Superfish, it actually conducts a man-in-middle attack by creating it’s own security certificates for connections encrypted over the HTTPS protocol.

Basically, Superfish is able to decrypt our encrypted connections and gain any information which they wanted to, as well as, bombard us with advertisements about absolute rubbish!

How Does Superfish Work?

Since most websites will establish a secure connection using HTTPS and SSL, then the website will need to obtain a security certificate and establish itself as who claims to be. Before I delve deeper into the details about Superfish, its important to explain what is HTTPS, SSL and certificates for those who do not know.

HTTPS (Hypertext Transfer Protocol Secure) is a amalgamation of the standard HTTP protocol and the SSL/TLS protocol to provide encrypted communication over a network. Ironically, the main purpose of HTTPS is to prevent man-in-the-middle attacks, but from the Superfish perspective, this is being largely ignored for the sake for being able to add more ads to secure connections.

HTTPS requires websites provide a valid security certificates, and that is certificate be signed by from a trustworthy certificate authority. We can view the certificate authorities which issue the digital certificates through Certificate Manager (certmgr.csc).

certmgrThe digital certificate is used to establish the ownership of a public key within a  public-private key pair which can be used to establish a secure and encrypted connection between a server and the user. It is vitally important that the Certificate Authority (CA) is trustworthy, since they’re responsible for validating the credentials for the digital certificate issued match the details of the website.

To ensure that the Public Key wasn’t stolen or being used by a malicious party, we need to use a Digital Certificate to verify the user. The CA will sign this certificate verifying that it is correct. However, if the CA owns the Public Key, then it will self-sign itself as trusted. This is known as a self-signed certificate or a Trusted Root Certificate. When this has been verified, then the Private Key can be used securely by the user to complete Public-Private Key Pair to encrypt and decrypt information.

Superfish creates it’s own Trusted Root Certificate for itself, and then uses that certificate to issue digital certificates to websites. From here, it is able to control and sign these digital certificates to decrypt the HTTPS connection and show advertisements to the user. This then leaves the user open to packet sniffing and man-in-middle attacks where an attacker (Superfish) will be able to obtain private information such as banking details.

mitmAdditionally, it’s important to remember that your web browser relies on the trustworthiness of these CA’s when issuing certificates.

Where and When Does Superfish Install It’s Own Certificate?

Superfish will install it’s self as CA to the Trusted Root Certification Authorities folder of the Certificates Manager as shown earlier. Superfish will then issue digital certificates which impose that website your visiting, and self sign that digital certificate to gain access to your encrypted connection. Superfish will apply this mechanism to all websites which you visit.

Cert

WordPress Certificate

The above image shows a genuine certificate issued by GoDaddy for WordPress.

Removing Superfish

By simply removing Superfish program, you will not remove the certificate of the issuing authority. You will need to remove the Superfish add-on, and then run several malware/adware removal tools to clean up any remnants of the add-on.

Here’s a removal guide which illustrates this: Malware Tips – Superfish Window Shopper Removal Guide

Affected Lenovo Models

“G Series: G410, G510, G710, G40-70, G50-70, G40-30, G50-30, G40-45, G50-45
U Series: U330P, U430P, U330Touch, U430Touch, U530Touch
Y Series: Y430P, Y40-70, Y50-70
Z Series: Z40-75, Z50-75, Z40-70, Z50-70
S Series: S310, S410, S40-70, S415, S415Touch, S20-30, S20-30Touch
Flex Series: Flex2 14D, Flex2 15D, Flex2 14, Flex2 15, Flex2 14(BTM), Flex2 15(BTM), Flex 10
MIIX Series: MIIX2-8, MIIX2-10, MIIX2-11
YOGA Series: YOGA2Pro-13, YOGA2-13, YOGA2-11BTM, YOGA2-11HSW
E Series: E10-30]”

Additional Reading:

What You Need to Know About Superfish, The Man-in-the-Middle Adware Installed on Lenovo PCs

Lenovo PCs ship with man-in-the-middle adware that breaks HTTPS connections

 

 

 

 

 

Posted in Computer Science, System Security | Leave a comment

Some Interesting Theorems and Conjectures

Following from a discussion of the Mathematics and Science forum called Function Space, I thought I would add a collection of the interesting topics being discussed there in this article. There is many interesting conjectures and theorems in Mathematics, however, my favorites are always the one with the strange names.

I’ve simply copied and pasted some of the theorems.

The ideas which will be discussed are the following:

  • Sausage Conjecture
  • McNugget Numbers*
  • Hairy Ball Theorem
  • Infinite Monkey Theorem
  • Ham Sandwich Theorem

Sausage Conjecture:

The Sausage Conjecture states that for any n dimensions greater than or equal to 5, if we were to arrange hyperspheres which have a convex hull of minimal content, the arrangement would always resemble a sausage. This is the best method for packing hyperspheres, or so the conjecture states.

The Sausage Conjecture is derived from the Penny Packing Problem, which asks what is the most optimal packing method for x non-overlapping n-dimensional spheres? The Sausage Conjecture then answers with the sausage shape. The spheres are arranged in a long line.

McNugget Numbers*:

I suggest reading my previous post, or reading the references section.

Hairy Ball Theorem:

Given a 2-sphere, which is your standard sphere, that we’re well acquainted with from our Euclidean Geometry classes, the theorem states that a hairy ball can never have all flat hairs.

Theorem: For the ordinary sphere, or 2‑sphere, if f is a continuous function that assigns a vector in R3 to every point p on a sphere such that f(p) is always tangent to the sphere at p, then there is at least one p such that f(p) = 0.

Infinite Monkey Theorem:

The theorem states if a immortal monkey is given infinite amount of time, and able to randomly type characters from a keyboard, then there is a chance that it will eventually type any given text. For example, a popular choice is the complete works of William Shakespeare. The probability is very small obviously, but doesn’t mean that’s it’s impossible.

The proof for the theorem relies upon statistically independent events, and the product of the probabilities of those events.

For instance, a standard keyboard has 104 keys available to randomly select, and thus there is a 1/104 chance of pressing any given key. Our given text will be something like, “Computer”, then the probability of typing the word “Computer” is (1/104)8, which is a very small probability but is mathematically plausible.

300px-Monkey-typingHam Sandwich Theorem:

Theorem: Given n measurable “objects” in n-dimensional space, it is possible to divide all of them in half (with respect to their measure, i.e. volume) with a single (n − 1)-dimensional hyperplane.

sandwich

References:

Unusual Terms in Mathematics – FunctionSpace

 

 

Posted in Discrete Geometry, Mathematics | Leave a comment

McNugget Numbers

McNugget Numbers are any integer n which can be satisfied with the linear combination of 6a + 9b + 20c. Although it is known that all integers (with some exceptions*) are McNugget Numbers, it is still interesting to see how many possible linear combinations can be used to satisfy some n.

Chicken McNuggets x20I’ve designed this small program to test if a number is potentially a McNugget Number. It will quickly sieve through if the number is able to satisfied with a x6 McNugget Box, a x9 McNugget Box or a x20 McNugget Box. Note that the program doesn’t consider all the possible ways of satisfying some integer n.

For example, 36 can be satisfied with {6,0,0}, {0,4,0} and {3,2,0}.

*As said earlier, the exception set of integers is the following: {1, 2, 3, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 22, 23, 25, 28, 31, 34, 37, 43}, if we use the original box size set of {6, 9, 20}. I’ve ignored the newer {4,6,9,20} set of coefficients for the linear combination since this drastically reduces the number of non McNugget numbers to the set of {1, 2, 3, 5, 7, 11}.

The code is written in C++. The source file can download directly from here.

References:

McNugget Number – from Wolfram MathWorld

 

 

 

 

 

Posted in Computer Science, Mathematics, Number Theory | Leave a comment