1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission...

23
Intel Itanium Processor Family Intel® Itanium® Processor Family in Mission Critical Computing in Mission Critical Computing 1D02 Andrey Semin Andrey Semin Technical Marketing Manager Technical Marketing Manager Intel GmbH HP HP User User Society Society IT IT-Symposium 2007 Symposium 2007 Nürnberg, April 17th, 2007

Transcript of 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission...

Page 1: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Intel Itanium Processor FamilyIntel® Itanium® Processor Familyin Mission Critical Computingin Mission Critical Computing

1D02

Andrey SeminAndrey SeminTechnical Marketing ManagerTechnical Marketing Manager

Intel GmbH

HP HP User User SocietySocietyITIT--Symposium 2007Symposium 2007

Nürnberg, April 17th, 2007

Page 2: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Agenda

• Introduction

• Intel commitment to reliability

• Processor data integrity

• Integrated platform error handling

• Advanced platform RAS capabilities

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 2

Page 3: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Soft Errors

• Ambient radiation can result in transient errors

– Alpha particles or neutron strikes result in charge events that can disrupt logic

– Soft errors do not cause permanent damage – does not – Soft errors do not cause permanent damage – does not indicate that processor is degraded

• Intel architectural analysis, circuit analysis, process design rules, microarchitecture, and machine check architecture address soft error rates

– Some susceptibility remains, since microarchitectural coverage can be expensive

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 3

can be expensive

– Soft errors become increasingly difficult with smaller feature sizes

– Lockstep operation, available on Montecito, provides highest level of reliability

Page 4: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Hard Errors

• Wear-out phenomena can result in functional failure and sometime permanent defects

– Electromigration

– Hot carrier effect– Hot carrier effect

– Silicide/poly cracking

• Intel circuit, layout, process rules, and manufacturing ensure that products meet 7-year warranted lifetime

– Intel designs for reliability correctness

– Manufacturing defects can reduce product life

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 4

– Pellston Technology addresses hard errors in Montecito processor’s L3 cache

Page 5: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

RISC/ Mainframe Class Reliability

• Data integrity– Error coverage across key elements of the processor – caches, TLB, FSB

– Montecito extends coverage to registers and hard errors in L3 cache, and adds lockstep support

L1 cache/ tags Parity

L2 cache/ tags ECC

L3 cache/ tags ECC

Registers Parity

TLB Parityerrors in L3 cache, and adds lockstep support

• High availability– Seamless error detection, correction, and recovery enabled by advanced machine check architecture

– Integrated platform error handling through hardware-firmware-OS partnership

• Comprehensive platform RAS– Enterprise leaders deliver comprehensive RAS

TLB Parity

FSB ECC

ApplicationsApplications

Operating SystemOperating SystemLogs errors, initiates recoveryLogs errors, initiates recovery

FirmwareFirmwareSeamlessly handles errorsSeamlessly handles errors

Extensive Error CoverageExtensive Error Coverage11

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation. 5

– Enterprise leaders deliver comprehensive RAS capabilities for Itanium® platforms

– Hard partitions, memory scrubbing, memory sparing, redundancy, hot swapping, and more

Comprehensive reliability, availability, and serviceability on Itanium platforms

1 Error coverage shown is for Montecito. Register coverage includes integer, floating point, and branch prediction.

Seamlessly handles errorsSeamlessly handles errors

HardwareHardwareECC coverage & parity protectionECC coverage & parity protection

Integrated Error HandlingIntegrated Error Handling

Page 6: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Intel Commitment to Reliability

• Thousands of Intel engineers working on reliability analysis, design, and verification

– Transient errors, wear-out phenomena, and post-silicon reliability analysisreliability analysis

• Extensive validation on Itanium® architecture

• Demonstrated capability of >500 years of continuous uptime

– On system using Itanium 2 processor 6M

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 6

Itanium architecture has had the most extensive error handling validation of any Intel processor

Page 7: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Itanium® 2 Processor 9M Error Protection

• Correction for most arrays via processor or PAL

• Detection for other exposed structures

• Mix of ECC, parity, multi-hit, and single-bit transition approaches adapted to structure

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 7

structure

Error coverage across key elements of the processor

Page 8: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Montecito Error Protection

• Builds on previous generation protection

• Extends protection and correction capabilities to correction capabilities to registers

• Adds coverage of hard errors in L3 cache with Pellston Technology

• Added protection for performance structures aids

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 8

performance structures aids lockstep support

Montecito further advances data integrity with extended coverage

Page 9: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Montecito Error Coverage SummaryStructure Protection Action1

L1 cache data Parity PAL-correctable

L1 cache tags Parity PAL-correctable

L2 cache data ECC HW-correctableL2 cache data ECC HW-correctable

L2 cache tags ECC HW-correctable

L3 cache dataECC

PellstonHW-correctable

Cache line disabled for hard errors

L3 cache tags ECC HW-correctable

Registers Parity OS-recoverable

TLB Parity OS-recoverable

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 9

TLB Parity OS-recoverable

FSB ECCHW-correctable 1-bit errorsOS-recoverable 2-bit errors

1 PAL-correctable and recoverable errors are dependent upon microarchitectural state

Error coverage across all major data structures

Page 10: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Advanced Cache Reliability

• Benefits – Automatically disables cache lines in the event of hard cache memory error

Intel® Cache Safe TechnologyIntel® Cache Safe Technology

memory error

– Reduces likelihood of 2-bit ECC errors in L3 cache that have single bit hard failures

– Allows processor and system to continue normal operation

• How it works– Cache line access with error detected

– Cache line is tested for hard error

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation. 10

– Cache line is tested for hard error

– If hard error is detected, cache line is disabled while processor and system continue normal operation

Intel® Cache Safe Technology (codename Pellston Technology) further advances Itanium®

architecture’s reliability

Page 11: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

L3 cache

•• Cache line access with Cache line access with

error detectederror detected

Montvale Processor Implementation of Intel® Cache Safe Technology

12

Tests for hard error in cache

– Automatically disables L3 cache lines exhibiting hard failures

Error Check Corrected Path

error detectederror detected

•• PAL test for hard error PAL test for hard error

in cache linein cache line

•• If hard error is If hard error is

detected, cache line is detected, cache line is

disableddisabled

2

3

PAL

1

error in cacheusing Pellston Algorithm

3

If hard error detected, cache line disabled

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation.

Error Check Corrected Path

Data is Consumed by Core

disableddisabled

•• Processor and system Processor and system

continue normal continue normal

operationoperation

Error detected

44

Cache Safe Technology Improves Cache ReliabilityCache Safe Technology Improves Cache Reliability

Page 12: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Lockstep Technology

• Processor socket-level redundancy for the highest level of data integrity, supported by Montecito

• Compare system interfaces of two processors– Deterministic system interface– Deterministic system interface

– In-line error correction prevents common, non-fatal losses of lockstep

– IERR indicates detected and recoverable error

FatalFatal

CheckerCheckerSystem InterfaceSystem Interface

IERRIERR

System InterfaceSystem Interface

IERRIERR

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 12

RecoverableRecoverable

Lockstep technology enables reliability for the most mission critical applications

MontecitoMontecito MontecitoMontecito

Page 13: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Application LayerApplication Layer

Operating SystemOperating System

Advanced Platform Reliability

Extensive hardware error detection and correction11

33

Advanced Machine Check ArchitectureAdvanced Machine Check Architecture

Operating SystemOperating SystemLogs errors & initiates recoveryLogs errors & initiates recovery

FirmwareFirmwarePAL & SALPAL & SAL

Seamlessly handles errorsSeamlessly handles errors

HardwareHardware

detection and correction

Well-defined flow for coordinated platform error handling

OS interacts with firmware to

22

3311

22

33

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2006, Intel Corporation. 13

HardwareHardwareCPU & chipset offer extensive CPU & chipset offer extensive

ECC coverage & parity protection ECC coverage & parity protection

OS interacts with firmware to correct and recover from complex platform errors

3311

Advanced machine check architecture provides smooth error handling for high reliability and availability

Page 14: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Reliability Partnership

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 14

Smooth, integrated platform error handling

Page 15: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Advanced Error Handling ProcessIncreasing Error Severity

System reset2-bit error in kernel

Error Handling

Non-recoverable

Category

Increasing Error Severity

Hardware corrected: Execution continues

Firmware corrected: Execution continues1-bit error in write-through cacheHard error in L3 cache with Pellston Technology1

OS corrected: Execution continuesTranslation register error

OS recoverable: System available

2-bit error in application

Recoverable

Corrected

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 15

Increasing Error Severity

Hardware corrected: Execution continuesMost 1-bit errors

1 Supported starting with Montecito processor.

Hardware corrects most errorsMulti-level error handling extends availability

Page 16: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

PAL & SAL Error Handling Benefits

• PAL provides programmatic interface for processor error data

– Abstraction from processor MSRs

– Improved handling of multiple errors

– Detailed communication on corrected, recoverable, and fatal errors– Detailed communication on corrected, recoverable, and fatal errors

• PAL and Pellston Technology

– PAL scrubs cache line to identify type of errors

– For soft errors PAL invalidates line and resumes execution

– For hard errors PAL disables line and resumes execution

• SAL specification provides testable flows and interfaces

– Defined flows for corrected machine checks, recoverable errors, and

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 16

– Defined flows for corrected machine checks, recoverable errors, and error logging

– Programmatic interface for processor and platform error data

– Multiprocessor error handling flows enables correction and recovery from global errorsAdvanced firmware is integral to strong RAS capabilities

Page 17: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

OS Error Recovery

• Extends error handling capability from firmware to operating system

• Flow defined by Itanium® 2 processor’s advanced machine check architecturemachine check architecture– Processor, platform, PAL, SAL, and OS responsibilities defined

• OS can correct errors if redundant data is available– e.g., translation registers

• OS can terminate applications to contain errors and maintain system availability– e.g., double-bit memory errors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 17

– e.g., double-bit memory errors

Itanium architecture’s advanced machine check architecture supports OS error recovery

Page 18: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Advanced Platform RAS

Feature1Itanium® 2 Processor

Intel® Xeon®

Processor MPIntel® Xeon®

ProcessorIBM

Power*Sun

SPARC*

Processor redundancy check (lockstep) �Montecito

Advanced machine check architecture � � �

L3 cache hard error recovery (Pellston) �Montecito �Tulsa �L3 cache hard error recovery (Pellston) �Montecito �Tulsa �

Hard partitioning � � �

Cache ECC or parity protection � � � � �

System bus ECC � � � �

I/O error recovery � � � � �

Memory single device data correction � � � � �

Memory scrubbing � � � � �

Processor & memory hot swap � � � �

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 18

Processor & memory hot swap � � � �

Component2 redundancy/ hot swap � � � � �

Itanium platforms support high end RAS for RISC and mainframe replacement

1 Some features may not be supported on all available hardware systems or operating systems.2 e.g. I/O cards, fans, power supplies.

Page 19: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Enabling Reliability

HP Integrity

HP Integrity

HP Integrity

HP Integrity

NonStop

NonStop

NonStop

NonStop

3 Seconds per Year3 Seconds per Year3 Seconds per Year3 Seconds per Year99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%

IBM zSeries

IBM zSeries

IBM zSeries

IBM zSeries

HP Integrity

HP Integrity

HP Integrity

HP Integrity

NonStop

NonStop

NonStop

NonStop

5 Minutes 15 Seconds 5 Minutes 15 Seconds 5 Minutes 15 Seconds 5 Minutes 15 Seconds

DowntimeDowntimeDowntimeDowntimeDowntimeDowntimeDowntimeDowntime per Yearper Yearper Yearper Year

99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%99.99999%

99.999%99.999%99.999%99.999%99.999%99.999%99.999%99.999%

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation.

1100 22 33 44 55 66MinutesMinutes

19

Page 20: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Summary• Itanium® 2 processor are designed to meet the most demanding enterprise reliability requirements

– RISC/ mainframe class RAS

– Subject to Intel’s most extensive error handling validation

• Error coverage across the processor ensures data integrity• Error coverage across the processor ensures data integrity

– ECC or parity for caches, TLB, and FSB

– Montecito extends coverage to registers and hard errors in L3 cache, and adds lockstep support

• Advanced machine check architecture supports high availability

– Smooth error detection, correction, and recovery enabled by integrated multi-level platform error handling

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 20

• Enterprise leaders deliver comprehensive RAS capabilities for Itanium platforms

– Architected APIs and error handling flows enable reliability innovation

– Available RAS features include hard partitions, memory scrubbing, memory sparing, redundancy, hot swapping, and more

For technical documents which include more details on Itanium architecture reliability see:

developer.intel.com/design/itanium2/documentation.htm

Page 21: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Backup

Page 22: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Firmware Enables Reliability

Architected APIs and error handling flows for hardware, PAL, SAL, and OS enables reliability innovation

Operating System Software

System Abstraction Layer (SAL)

Operating System Software

Processor Abstraction Layer (PAL)

Processor (Hardware)

Extensible Firmware Interface (EFI)

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation.

Platform (Hardware)

Processor (Hardware)

Page 23: 1D02 - Itanium in mission critical computing V2Intel® Itanium® Processor Family in Mission Critical Computing 1D02 Andrey Semin Technical Marketing Manager Intel GmbH HP HP User

Machine Check Architecture

Defines processor, chipset, firmware, and operating system responsibilities for advanced error handling

RAS Feature SummaryError Detection Processor Cache Parity/ECCError Detection Processor Cache Parity/ECCError Correction SPS Snoop Filter ECC

ECC/parity on busesMemory ECCMemory scrubbingControl/operational errorsTimeout Detection

Error Containment Correction/Data poisoningTransaction error responseThermal Sensors/Management

Error Status/Signaling Error typingError masking

Advanced Machine Check

Architecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2007, Intel Corporation. 23

First error / Next error statusError Logging Error logs (control/data)

Multi-node Error TrailServiceability Memory Device Failure Recovery

PCI hotplugNode hotplug

Multinode Features Multi-pathing