1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission...
Transcript of 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission...
Intel Itanium Processor FamilyIntel® Itanium® Processor Familyin Mission Critical Computingin Mission Critical Computing
1D02
Andrey SeminAndrey SeminTechnical Marketing ManagerTechnical Marketing Manager
Intel GmbH
HP HP User User SocietySocietyITIT--Symposium 2007Symposium 2007
Nürnberg, April 17th, 2007
Agenda
• Introduction
• Intel commitment to reliability
• Processor data integrity
• Integrated platform error handling
• Advanced platform RAS capabilities
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 2
Soft Errors
• Ambient radiation can result in transient errors
– Alpha particles or neutron strikes result in charge events that can disrupt logic
– Soft errors do not cause permanent damage – does not – Soft errors do not cause permanent damage – does not indicate that processor is degraded
• Intel architectural analysis, circuit analysis, process design rules, microarchitecture, and machine check architecture address soft error rates
– Some susceptibility remains, since microarchitectural coverage can be expensive
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 3
can be expensive
– Soft errors become increasingly difficult with smaller feature sizes
– Lockstep operation, available on Montecito, provides highest level of reliability
Hard Errors
• Wear-out phenomena can result in functional failure and sometime permanent defects
– Electromigration
– Hot carrier effect– Hot carrier effect
– Silicide/poly cracking
• Intel circuit, layout, process rules, and manufacturing ensure that products meet 7-year warranted lifetime
– Intel designs for reliability correctness
– Manufacturing defects can reduce product life
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 4
– Pellston Technology addresses hard errors in Montecito processor’s L3 cache
RISC/ Mainframe Class Reliability
• Data integrity– Error coverage across key elements of the processor – caches, TLB, FSB
– Montecito extends coverage to registers and hard errors in L3 cache, and adds lockstep support
L1 cache/ tags Parity
L2 cache/ tags ECC
L3 cache/ tags ECC
Registers Parity
TLB Parityerrors in L3 cache, and adds lockstep support
• High availability– Seamless error detection, correction, and recovery enabled by advanced machine check architecture
– Integrated platform error handling through hardware-firmware-OS partnership
• Comprehensive platform RAS– Enterprise leaders deliver comprehensive RAS
TLB Parity
FSB ECC
ApplicationsApplications
Operating SystemOperating SystemLogs errors, initiates recoveryLogs errors, initiates recovery
FirmwareFirmwareSeamlessly handles errorsSeamlessly handles errors
Extensive Error CoverageExtensive Error Coverage11
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation. 5
– Enterprise leaders deliver comprehensive RAS capabilities for Itanium® platforms
– Hard partitions, memory scrubbing, memory sparing, redundancy, hot swapping, and more
Comprehensive reliability, availability, and serviceability on Itanium platforms
1 Error coverage shown is for Montecito. Register coverage includes integer, floating point, and branch prediction.
Seamlessly handles errorsSeamlessly handles errors
HardwareHardwareECC coverage & parity protectionECC coverage & parity protection
Integrated Error HandlingIntegrated Error Handling
Intel Commitment to Reliability
• Thousands of Intel engineers working on reliability analysis, design, and verification
– Transient errors, wear-out phenomena, and post-silicon reliability analysisreliability analysis
• Extensive validation on Itanium® architecture
• Demonstrated capability of >500 years of continuous uptime
– On system using Itanium 2 processor 6M
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 6
Itanium architecture has had the most extensive error handling validation of any Intel processor
Itanium® 2 Processor 9M Error Protection
• Correction for most arrays via processor or PAL
• Detection for other exposed structures
• Mix of ECC, parity, multi-hit, and single-bit transition approaches adapted to structure
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 7
structure
Error coverage across key elements of the processor
Montecito Error Protection
• Builds on previous generation protection
• Extends protection and correction capabilities to correction capabilities to registers
• Adds coverage of hard errors in L3 cache with Pellston Technology
• Added protection for performance structures aids
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 8
performance structures aids lockstep support
Montecito further advances data integrity with extended coverage
Montecito Error Coverage SummaryStructure Protection Action1
L1 cache data Parity PAL-correctable
L1 cache tags Parity PAL-correctable
L2 cache data ECC HW-correctableL2 cache data ECC HW-correctable
L2 cache tags ECC HW-correctable
L3 cache dataECC
PellstonHW-correctable
Cache line disabled for hard errors
L3 cache tags ECC HW-correctable
Registers Parity OS-recoverable
TLB Parity OS-recoverable
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 9
TLB Parity OS-recoverable
FSB ECCHW-correctable 1-bit errorsOS-recoverable 2-bit errors
1 PAL-correctable and recoverable errors are dependent upon microarchitectural state
Error coverage across all major data structures
Advanced Cache Reliability
• Benefits – Automatically disables cache lines in the event of hard cache memory error
Intel® Cache Safe TechnologyIntel® Cache Safe Technology
memory error
– Reduces likelihood of 2-bit ECC errors in L3 cache that have single bit hard failures
– Allows processor and system to continue normal operation
• How it works– Cache line access with error detected
– Cache line is tested for hard error
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation. 10
– Cache line is tested for hard error
– If hard error is detected, cache line is disabled while processor and system continue normal operation
Intel® Cache Safe Technology (codename Pellston Technology) further advances Itanium®
architecture’s reliability
L3 cache
•• Cache line access with Cache line access with
error detectederror detected
Montvale Processor Implementation of Intel® Cache Safe Technology
12
Tests for hard error in cache
– Automatically disables L3 cache lines exhibiting hard failures
Error Check Corrected Path
error detectederror detected
•• PAL test for hard error PAL test for hard error
in cache linein cache line
•• If hard error is If hard error is
detected, cache line is detected, cache line is
disableddisabled
2
3
PAL
1
error in cacheusing Pellston Algorithm
3
If hard error detected, cache line disabled
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation.
Error Check Corrected Path
Data is Consumed by Core
disableddisabled
•• Processor and system Processor and system
continue normal continue normal
operationoperation
Error detected
44
Cache Safe Technology Improves Cache ReliabilityCache Safe Technology Improves Cache Reliability
Lockstep Technology
• Processor socket-level redundancy for the highest level of data integrity, supported by Montecito
• Compare system interfaces of two processors– Deterministic system interface– Deterministic system interface
– In-line error correction prevents common, non-fatal losses of lockstep
– IERR indicates detected and recoverable error
FatalFatal
CheckerCheckerSystem InterfaceSystem Interface
IERRIERR
System InterfaceSystem Interface
IERRIERR
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 12
RecoverableRecoverable
Lockstep technology enables reliability for the most mission critical applications
MontecitoMontecito MontecitoMontecito
Application LayerApplication Layer
Operating SystemOperating System
Advanced Platform Reliability
Extensive hardware error detection and correction11
33
Advanced Machine Check ArchitectureAdvanced Machine Check Architecture
Operating SystemOperating SystemLogs errors & initiates recoveryLogs errors & initiates recovery
FirmwareFirmwarePAL & SALPAL & SAL
Seamlessly handles errorsSeamlessly handles errors
HardwareHardware
detection and correction
Well-defined flow for coordinated platform error handling
OS interacts with firmware to
22
3311
22
33
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation. 13
HardwareHardwareCPU & chipset offer extensive CPU & chipset offer extensive
ECC coverage & parity protection ECC coverage & parity protection
OS interacts with firmware to correct and recover from complex platform errors
3311
Advanced machine check architecture provides smooth error handling for high reliability and availability
Reliability Partnership
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 14
Smooth, integrated platform error handling
Advanced Error Handling ProcessIncreasing Error Severity
System reset2-bit error in kernel
Error Handling
Non-recoverable
Category
Increasing Error Severity
Hardware corrected: Execution continues
Firmware corrected: Execution continues1-bit error in write-through cacheHard error in L3 cache with Pellston Technology1
OS corrected: Execution continuesTranslation register error
OS recoverable: System available
2-bit error in application
Recoverable
Corrected
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 15
Increasing Error Severity
Hardware corrected: Execution continuesMost 1-bit errors
1 Supported starting with Montecito processor.
Hardware corrects most errorsMulti-level error handling extends availability
PAL & SAL Error Handling Benefits
• PAL provides programmatic interface for processor error data
– Abstraction from processor MSRs
– Improved handling of multiple errors
– Detailed communication on corrected, recoverable, and fatal errors– Detailed communication on corrected, recoverable, and fatal errors
• PAL and Pellston Technology
– PAL scrubs cache line to identify type of errors
– For soft errors PAL invalidates line and resumes execution
– For hard errors PAL disables line and resumes execution
• SAL specification provides testable flows and interfaces
– Defined flows for corrected machine checks, recoverable errors, and
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 16
– Defined flows for corrected machine checks, recoverable errors, and error logging
– Programmatic interface for processor and platform error data
– Multiprocessor error handling flows enables correction and recovery from global errorsAdvanced firmware is integral to strong RAS capabilities
OS Error Recovery
• Extends error handling capability from firmware to operating system
• Flow defined by Itanium® 2 processor’s advanced machine check architecturemachine check architecture– Processor, platform, PAL, SAL, and OS responsibilities defined
• OS can correct errors if redundant data is available– e.g., translation registers
• OS can terminate applications to contain errors and maintain system availability– e.g., double-bit memory errors
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 17
– e.g., double-bit memory errors
Itanium architecture’s advanced machine check architecture supports OS error recovery
Advanced Platform RAS
Feature1Itanium® 2 Processor
Intel® Xeon®
Processor MPIntel® Xeon®
ProcessorIBM
Power*Sun
SPARC*
Processor redundancy check (lockstep) �Montecito
Advanced machine check architecture � � �
L3 cache hard error recovery (Pellston) �Montecito �Tulsa �L3 cache hard error recovery (Pellston) �Montecito �Tulsa �
Hard partitioning � � �
Cache ECC or parity protection � � � � �
System bus ECC � � � �
I/O error recovery � � � � �
Memory single device data correction � � � � �
Memory scrubbing � � � � �
Processor & memory hot swap � � � �
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 18
Processor & memory hot swap � � � �
Component2 redundancy/ hot swap � � � � �
Itanium platforms support high end RAS for RISC and mainframe replacement
1 Some features may not be supported on all available hardware systems or operating systems.2 e.g. I/O cards, fans, power supplies.
Enabling Reliability
HP Integrity
HP Integrity
HP Integrity
HP Integrity
NonStop
NonStop
NonStop
NonStop
3 Seconds per Year3 Seconds per Year3 Seconds per Year3 Seconds per Year99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%
IBM zSeries
IBM zSeries
IBM zSeries
IBM zSeries
HP Integrity
HP Integrity
HP Integrity
HP Integrity
NonStop
NonStop
NonStop
NonStop
5 Minutes 15 Seconds 5 Minutes 15 Seconds 5 Minutes 15 Seconds 5 Minutes 15 Seconds
DowntimeDowntimeDowntimeDowntimeDowntimeDowntimeDowntimeDowntime per Yearper Yearper Yearper Year
99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%
99.999%99.999%99.999%99.999%99.999%99.999%99.999%99.999%
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation.
1100 22 33 44 55 66MinutesMinutes
19
Summary• Itanium® 2 processor are designed to meet the most demanding enterprise reliability requirements
– RISC/ mainframe class RAS
– Subject to Intel’s most extensive error handling validation
• Error coverage across the processor ensures data integrity• Error coverage across the processor ensures data integrity
– ECC or parity for caches, TLB, and FSB
– Montecito extends coverage to registers and hard errors in L3 cache, and adds lockstep support
• Advanced machine check architecture supports high availability
– Smooth error detection, correction, and recovery enabled by integrated multi-level platform error handling
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 20
• Enterprise leaders deliver comprehensive RAS capabilities for Itanium platforms
– Architected APIs and error handling flows enable reliability innovation
– Available RAS features include hard partitions, memory scrubbing, memory sparing, redundancy, hot swapping, and more
For technical documents which include more details on Itanium architecture reliability see:
developer.intel.com/design/itanium2/documentation.htm
Backup
Firmware Enables Reliability
Architected APIs and error handling flows for hardware, PAL, SAL, and OS enables reliability innovation
Operating System Software
System Abstraction Layer (SAL)
Operating System Software
Processor Abstraction Layer (PAL)
Processor (Hardware)
Extensible Firmware Interface (EFI)
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation.
Platform (Hardware)
Processor (Hardware)
Machine Check Architecture
Defines processor, chipset, firmware, and operating system responsibilities for advanced error handling
RAS Feature SummaryError Detection Processor Cache Parity/ECCError Detection Processor Cache Parity/ECCError Correction SPS Snoop Filter ECC
ECC/parity on busesMemory ECCMemory scrubbingControl/operational errorsTimeout Detection
Error Containment Correction/Data poisoningTransaction error responseThermal Sensors/Management
Error Status/Signaling Error typingError masking
Advanced Machine Check
Architecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 23
First error / Next error statusError Logging Error logs (control/data)
Multi-node Error TrailServiceability Memory Device Failure Recovery
PCI hotplugNode hotplug
Multinode Features Multi-pathing