Reliability, Availability, and Serviceability (RAS) on...

33
Reliability, Availability, and Serviceability (RAS) on AArch64 Fu Wei (Linaro LEG) Supreeth Venkatesh (ARM)

Transcript of Reliability, Availability, and Serviceability (RAS) on...

Reliability, Availability, and Serviceability (RAS) on AArch64

Fu Wei (Linaro LEG) Supreeth Venkatesh (ARM)

ENGINEERS AND DEVICESWORKING TOGETHER

AGENDA1. Brief introduction of RAS

○ Definition, Importance, History

2. RAS on AArch64○ Overview

■ Hardware support

● RAS Extension■ Software Architecture

● ARM-Trusted-Firmware, UEFI, APEI tables● SDEI

○ Prototype Solution for Firmware First Error Handling■ MM Secure Partition, Secure Partition Manager■ Uncorrected error -- HEST & MM ■ Demo time

3. Status and Future Plans

Brief introduction of RAS● What is RAS?● Why do we need RAS?● History of RAS

ENGINEERS AND DEVICESWORKING TOGETHER

What is RAS? -- Definition

ReliabilityContinuity, Computation needs be correct and reliable.

Availability Readiness, System needs to remain available as long as possible.

ServiceabilityAbility to undergo modifications and repairs,System should provide information to administrator to aid in system servicing.

The RAS architecture primarily cares about ERRORs produced from HARDWARE .

ENGINEERS AND DEVICESWORKING TOGETHER

Why do we need RAS? -- ImportanceImpacts

Continuity, Computation needs be correct and reliable.

InevitabilityAlthough faults are rare, enterprise systems can be very large. So failures are inevitable.

So we have to maintain system very well, and Operating Expense (OPEX) for maintenance is inevitable.

OPEX for maintenance is reduced by 1. replacing only failed parts2. scheduled maintenance (is

cheaper than unscheduled service outages)

ENGINEERS AND DEVICESWORKING TOGETHER

Why do we need RAS? -- Importance

Inevitability

<DRAM Errors in the Wild: A Large-Scale Field Study>by Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber

Important Conclusion:● The incidence of memory errors and the range of

error rates across different DIMMs to be much higher than previously reported.

● Memory errors are strongly correlated.● The incidence of CEs increases with age, while

the incidence of UEs decreases with age (due to re-placements).

● Error rates are unlikely to be dominated by soft errors.

Benefit from ECC in DIMM

● Single-bit error --> CE● Avoid (multi-bit errors)UEs from

beginning (Single-bit error, CEs)● The statistical data of CEs/UEs could

be a reference for maintenance to reduce the cost of unscheduled service outage.

ENGINEERS AND DEVICESWORKING TOGETHER

Server without RASHow to avoid "Inevitability" ?

To Be Successful in Business, You Need a Little Luck. --Richard Branson/* _ooOoo_ o8888888o 88" . "88 (| -_- |) O\ = /O ____/`---'\____ .' \\| |// `. / \\||| : |||// \ / _||||| -:- |||||- \ | | \\\ - /// | | | \_| ''\---/'' | | \ .-\__ `-` ___/-. / ___`. .' /--.--\ `. . __ ."" '< `.___\_<|>_/___.' >'"". | | : `- \`.;`\ _ /`;.`/ - ` : | | \ \ `-. \_ __\ /__ _/ .-` / / ======`-.____`-.___\_____/___.-`____.-'====== `=---=' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Buddha blessed No Error Forever */

Emperor YongzhengDefeat Ba A Ge (Bug)

ENGINEERS AND DEVICESWORKING TOGETHER

HistoryECC in memory controllers and I/O RAMsMachine Check Architecture (MCA)

● A mechanism in which the CPU reports hardware errors to the OS

○ model-specific registers (MSRs)■ set up machine checking■ record detected errors■ the info they contain is CPU

specific○ Machine Check Exception (MCE)

■ signals the detection of an uncorrected machine-check error

■ handler collect information about error from MSRs

○ Utility: mcelog

PCI-E: Advanced Error Reporting (AER)

Linux kernel● EDAC (Error Detection and Correction)

○ designed to report and possibly act on hardware errors

○ inspect the hardware directly (system-specific handling and decoding.)

○ only support memory controller and PCI/AGP errors

Firmware (first) FF● APEI● UEFI

RAS on AArch64● Overview of Hardware & Software● Prototype Solution for Firmware First Error Handling

ENGINEERS AND DEVICESWORKING TOGETHER

Hardware support for RAS● CPU

○ ARMv8-A architecture (a mandatory extension to ARMv8.2)

○ EL2, EL3, or both○ Virtualization extension or Security

extensions or both

● GICv3○ Interrupt routing modes○ Private and shared interrupts (PPI/SPI)○ Ability to set an interrupt pending event

signaling and delegationInterrupt groups/priority

RAS Extension● ESB (Error Synchronization Barrier) instructions● RAS Extension registers● Corrupted data poisoning

ENGINEERS AND DEVICESWORKING TOGETHER

RAS ExtensionESB instruction

● ESB (Error Synchronization Barrier) can be used to isolate Unrecoverable errors.

● Software can determine that:○ The error was reported as

Unrecoverable.○ The preferred return address of the SEI

is an ESB instruction. ○ The software between that ESB and

the previous ESB can be isolated. ○ ESB might update

DISR_EL1 / DISR (Deferred Interrupt Status Register) and VDISR_EL2 / VDISR (Virtual Deferred Interrupt Status Register)

RAS Extension registers:● Feature Register/Component ID Register● Error Record Register

○ Feature○ Control○ Record Primary Syndrome○ Record Address Register○ Record Miscellaneous Registers

● Hypervisor Configuration Register● Virtual SError Exception Syndrome Register● Secure Configuration Register

Or● Interrupt Register for Fault-Handling and

Recovery● Device Affinity and Architecture Register

ENGINEERS AND DEVICESWORKING TOGETHER

RAS Extension -- gather HW error info for FWESB instruction

Help to locate Error

RAS Extension registers● Provide the error info to FW● Control the Interrupt by FW

ARMv8-A RAS extensions standardize the interface between HW and FW

ENGINEERS AND DEVICESWORKING TOGETHER

Software ArchitectureFirmware First error handling requires standard interfaces between multiple SW components.

ENGINEERS AND DEVICESWORKING TOGETHER

Softw

are Architecture

ENGINEERS AND DEVICESWORKING TOGETHER

FirmwareARM Trusted Firmware● Reference EL3 Runtime (BL31)

○ Standard power control (PSCI)○ Optional Trusted OS integration

● Trusted boot firmware○ Optional○ Compatible with other firmware (like

EDK2)● Applicable to all segments● Open Source at GitHub with

BSD-3-clause license

UEFIUnified Extensible Firmware Interface. Firmware interface between the platform and the operating system. Predominate interfaces are in the boot services (BS) or pre-OS. Few runtime (RT) services.On AArch64, it (tianocore EDK2) works with ARM TF as BL33 in EL2

ENGINEERS AND DEVICESWORKING TOGETHER

APEI (ACPI Platform Error Interfaces)

APEI

EINJ ERST

BERT HEST For runtimeFor last crash

For TestingFor Storage Provides a standard way to convey error infofrom Firmware to OS

ENGINEERS AND DEVICESWORKING TOGETHER

APEI tables HEST (Hardware Error Source Table)Key info: HOW to get trigger WHERE are the error records HOW to release records’ mem

For ARM64 : GHES v2HOW to get trigger: Notification StructureWHERE are the error records:

Error Status Address(GAS : Generic Address Structure)

HOW to release records’ mem:

Read Ack Register

For IA-32 : MCE/CMC/NMI

For PCI: AER Root Port/Endpoint/Bridge

For generic hardware: GHES (Generic Hardware Error Source) V1/V2

ENGINEERS AND DEVICESWORKING TOGETHER

APEI tablesBERT: Boot Error Record TableRecord fatal errors, then report it in the second boot

CPER (in the Appendix of UEFI spec)Common Platform Error Record, with this help, OS can get all kinds of error we could think of.

ENGINEERS AND DEVICESWORKING TOGETHER

APEI tables -- ERST & EINJ● ERST: Error Record Serialization Table

○ Operation abstract, provides details necessary to communicate with on-board persistent storage for error recording

● EINJ: Error Injection Table○ Operation abstract, provides a generic

interface which OSPM can inject hardware errors to the platform without requiring platform specific software.

ENGINEERS AND DEVICESWORKING TOGETHER

SDEI usage in RASSoftware Delegated Exception Interface:An interface between FW & OS, for registering, notifying and servicing system events using SMC/HVC. SDEI Specification (ARM DEN0054A)

ENGINEERS AND DEVICESWORKING TOGETHER

Prototype Solution for Error Handling

MM Secure Partition, Secure Partition ManagerUncorrected error -- HEST & MM

ENGINEERS AND DEVICESWORKING TOGETHER

What Are We Doing● Define standard interfaces to enable FF handling of AP RAS errors● Demonstrate use of RAS extensions● Demonstrate interfaces with reference software and platforms

○ uncorrected DIMM & CPU errors

ENGINEERS AND DEVICESWORKING TOGETHER

MM Secure Partition● MM Secure Partition implements

management functions , runs in S-EL0 to achieve isolation from S-EL1 & EL3

○ Leverages existing firmware code based on EDK2: Standalone MM

● Partition communicates with ARM TF through a standard interface: MM_COMMUNICATE SMC

● Partition is managed by ARM TF● ARM TF BL31 stage owns EL3

and S-EL1● Secure partition resources are

described in BL31 platform port● Minimise code in EL3 and

delegate RAS error handling

ENGINEERS AND DEVICESWORKING TOGETHER

Secure Partition Manager (SPM)● Secure Partition Manager in BL31 exports standard ABI to

○ Initialize the partition○ Delegate SMC requests to the partition

ENGINEERS AND DEVICESWORKING TOGETHER

Uncorrected

Error

-- HE

ST &

MM

ENGINEERS AND DEVICESWORKING TOGETHER

Uncorrected Error -- HEST & MM1. System boot: BootROM-->BL2-->BL3x

a. BL31 initializes SPM (includes MM dispatcher) and SDEI dispatcher.

b. UEFI (BL33), DXE, UEFI Platform Driver:i. query SPI (Secure Partition

Image, BL32) for error source infoii. SPI return error source info back

to UEFIiii. UEFI map in and mark error

record region as Runtime Services Data Region

iv. Update/add error source info in HEST

2. OS starts running: HEST driver scan HEST table and register error handlers by SDEI

3. UE occurred, the event will be routed to EL3 (SPM)

4. SPM routes the event to RAS error handler in S-EL0 (MM Foundation)

5. MM Foundation creates the CPER blobs by the info from RAS Extension

6. SPM notifies SDEI to call the corresponding OS registered handler

7. OS gets the CPER blobs by Error Status Address block, process the error, try to recovery.

8. report the error event by RAS event9. rasdaemon log error info from RAS event to

recorder

ENGINEERS AND DEVICESWORKING TOGETHER

Demo Time● Prototype RAS solution on FVP

○ arm-trusted-firmware (bl1, bl2, bl31)

○ tianocore edk2 (bl32, bl33) ○ Linux kernel, Shell command

Status and Future Plans● Current development status● Ongoing development● TODO list for Reference Solution

ENGINEERS AND DEVICESWORKING TOGETHER

Current development status● Hardware

○ ARM engineers are working FVP, LEG team is developing on QEMU○ RAS spec has released (ARM DDI 0587A)

● Firmware○ SDEI

■ SDEI Specification released (ARM DEN0054A)■ SDEI added as hardware error notification type in ACPI 6.2■ Linux SDEI client implementation v3 patchset has been posted on

kvmarm and devicetree mailing list.■ ACPICA support for SDEI up-streamed■ SDEI DT bindings acked■ ARM TF support posted to github and includes

● SDEI Dispatcher ● Framework for managing interrupts handled in EL3

● OS (Linux): ○ APEI on ARM64 can be enabled in kernel.○ Memory failure support merged

ENGINEERS AND DEVICESWORKING TOGETHER

Ongoing development● ARM TF

○ Simplify error interrupt handling for platform ports

○ Framework for handling External aborts (EA) in design

○ RAS Extensions support in design■ ESB■ RAS Error Record driver

● EDK2○ Driver for creating APEI HEST under development

○ Library for creating APEI CPERs under development

○ Prototyping use of Standalone MM partition to create error records on QEMU

● OS(Linux)○ KVM changes for virtualizing SDEI under development

ENGINEERS AND DEVICESWORKING TOGETHER

TODO list● Hardware

○ Test on a real hardware (ARMv8.2, including RAS extension) ● Firmware

○ ARM-TF■ Support for Double fault handling■ Support for v8.4 RAS Extensions

○ EDK2:■ Support for BERT■ ERST and EINJ implementation

ENGINEERS AND DEVICES

WORKING TOGETHER

Acknowledgments● Achin Gupta (ARM)● John Feeney (Red Hat)● Leif Lindholm (ARM)● Supreeth Venkatesh (ARM)

Thank You

#SFO17BUD17 keynotes and videos on: connect.linaro.orgFor further information: www.linaro.org