Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow,...

41
Longhorn: Designed Longhorn: Designed and Built for and Built for Reliability Reliability Mario Garzia, Director of Mario Garzia, Director of Development Development Björn Levidow, Group Program Björn Levidow, Group Program Manager Manager Windows Reliability Windows Reliability Microsoft Corporation Microsoft Corporation

Transcript of Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow,...

Page 1: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Longhorn: Designed and Longhorn: Designed and Built for ReliabilityBuilt for Reliability

Mario Garzia, Director of DevelopmentMario Garzia, Director of DevelopmentBjörn Levidow, Group Program ManagerBjörn Levidow, Group Program Manager

Windows ReliabilityWindows ReliabilityMicrosoft CorporationMicrosoft Corporation

Page 2: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

AgendaAgenda

What is ReliabilityWhat is Reliability Windows Approach to Reliability Windows Approach to Reliability

Engineering Engineering How is Windows DoingHow is Windows Doing Reliability during the Development ProcessReliability during the Development Process The Digital Feedback LoopThe Digital Feedback Loop A Look at Longhorn Reliability FeaturesA Look at Longhorn Reliability Features SummarySummary

Page 3: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

What is Reliability?What is Reliability?

Page 4: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Reliability Delights CustomersReliability Delights Customers

Extensive analysis of customer feedback…

Annual marketing surveys covering all MS products

• Approximately 33K customers

• Comments from 3,000 very satisfied and very dissatisfied customers

• Product quality one of several categories tested

…shows poor reliability decreases satisfaction…

Data shows reliability is the greatest source of product dissatisfaction

More than 50% of dissatisfied customers blame reliability

…while great reliability is a customer delighter

Many customers cite reliability as their reason for product satisfaction

Reliability features among those with highest customer interest and most likely to lead to upgrade

Longhorn is built to delight our customers!Longhorn is built to delight our customers!

Page 5: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Customer’s Perspective on ReliabilityCustomer’s Perspective on Reliability

Reliability involves more than just crashes and Reliability involves more than just crashes and hangs. It involves: hangs. It involves: Bugs, errors, faultsBugs, errors, faults Application crashes & blue screensApplication crashes & blue screens Patches, hotfixes & service packsPatches, hotfixes & service packs Reliable, stable, predictable, …Reliable, stable, predictable, … Freezes, hangs & lock-upsFreezes, hangs & lock-ups Restarts, reboots , downtime periodsRestarts, reboots , downtime periods Readiness for RTM, finished, complete, mature, …Readiness for RTM, finished, complete, mature, … Multiple installs, rebuilds, reformat … to fix problemsMultiple installs, rebuilds, reformat … to fix problems Being able to run needed softwareBeing able to run needed software Data loss!Data loss!

Page 6: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Microsoft’s Customer-Focused Reliability Attributes

Attribute Definition Examples

Page 7: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Resilient

The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions

Microsoft’s Customer-Focused Reliability Attributes

Attribute Definition Examples

crashes, hangs …

Page 8: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss

Resilient

The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions

Microsoft’s Customer-Focused Reliability Attributes

Attribute Definition Examples

crashes, hangs …data corruption

Page 9: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss

Resilient

The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions

Controlled

Provides timely and expected service Provides timely and expected service whenever neededwhenever needed

Microsoft’s Customer-Focused Reliability Attributes

Attribute Definition Examples

crashes, hangs …

degraded response

data corruption

Page 10: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss

Resilient

The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions

UndisruptableRequired changes and upgrades do Required changes and upgrades do not impact the servicenot impact the service

Controlled

Provides timely and expected service Provides timely and expected service whenever neededwhenever needed

Microsoft’s Customer-Focused Reliability Attributes

Attribute Definition Examples

crashes, hangs …

degraded response

update disruptions

data corruption

Page 11: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss

Resilient

The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions

Production Ready

At release the system contains a At release the system contains a minimum number of bugs, requiring a minimum number of bugs, requiring a limited number of predictable limited number of predictable patches/fixespatches/fixes

UndisruptableRequired changes and upgrades do Required changes and upgrades do not impact the servicenot impact the service

Controlled

Provides timely and expected service Provides timely and expected service whenever neededwhenever needed

Microsoft’s Customer-Focused Reliability Attributes

Attribute Definition Examples

crashes, hangs …

degraded response

update disruptions

patch size, frequency

data corruption

Page 12: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss

Resilient

The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions

Predictable It works as advertised, what worked It works as advertised, what worked before works nowbefore works now

Production Ready

At release the system contains a At release the system contains a minimum number of bugs, requiring a minimum number of bugs, requiring a limited number of predictable limited number of predictable patches/fixespatches/fixes

UndisruptableRequired changes and upgrades do Required changes and upgrades do not impact the servicenot impact the service

Controlled

Provides timely and expected service Provides timely and expected service whenever neededwhenever needed

Microsoft’s Customer-Focused Reliability Attributes

Attribute Definition Examples

crashes, hangs …

degraded response

update disruptions

patch size, frequency

compatibility failures

data corruption

Page 13: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Windows Approach to Windows Approach to Reliability EngineeringReliability Engineering

Page 14: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Engineering Software for ReliabilityEngineering Software for ReliabilityReliability improvement throughout the product’s lifecycleReliability improvement throughout the product’s lifecycle

Design for ReliabilityDesign for Reliability Minimize user disruptionsMinimize user disruptions Architectural reliability improvementsArchitectural reliability improvements Enhanced recovery & resiliency Enhanced recovery & resiliency

Product TestingProduct Testing Reliability-focused testingReliability-focused testing Reliability release criteriaReliability release criteria

Customer FeedbackCustomer Feedback Data-driven, realistic assessmentData-driven, realistic assessment Problem prioritizationProblem prioritization Input from many sourcesInput from many sources

Product Updates & Best Practices Product Updates & Best Practices Problem prevention and product fixesProblem prevention and product fixes Best practices Best practices

Design for Design for ReliabilityReliability

Product Product UpdatesUpdates

WindowsWindowsReliabilityReliability

EngineeringEngineering

ProductProductTestingTesting

Customer Customer FeedbackFeedback

Page 15: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Assuring Product ReliabilityAssuring Product ReliabilityReliability Built into all Stages of Product DevelopmentReliability Built into all Stages of Product Development

Reliability goals and objectives

Reliability engineering practices

Windows LongHaul reliability tracking

Static and Dynamic tools for code quality, e.g., Prefast, Prefix

Production environment reliability validation prior to release

Reliability tracking at the component level

Reliability• MTTShutdown• MTTCrash

Availability• MTTRestore• Uptime• Downtime

Causes of disruptions

• Crashes• Reboots• Hangs

Watson/OCA

Service Quality Monitoring

Microsoft Reliability Analysis Service

Reliability planning and monitoring

Reliability testingReliability release

criteria

Production environment

tracking

Products for Reliability

tracking in the field

Design and developmentDesign and development TestingTesting ReleaseRelease Post-releasePost-release

Page 16: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Ongoing Customer AssessmentOngoing Customer AssessmentCustomer Measurement ProgramCustomer Measurement Program

• Customer Measurement Program initiated in Windows NT 4.0

• Provides detailed information for improving subsequent versions of Windows

• Longhorn reliability improvements are the direct results of these measurements

Customer Measurement Program Details

Production server measurements

• 4,600 NT4 servers at 15 sites• More than 10,000 Windows 2000 servers at

more than 30 sites• Windows Server 2003 at more than 50

customer sites• Pre-release reliability assessment of JDP

customers• Ship criteria based on meeting reliability

objectives

Client corporate and consumer measurements

• Crash and Hang failure information from millions of customer world-wide

• More than 10,000 consumer systems• Thousands of corporate desktops/laptops

measured

Page 17: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

How is Windows Doing?How is Windows Doing?

Page 18: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Improved Reliability and Availability Improved Reliability and Availability

Network connectivity/ power 2%

OS & Driver / Adaptor: Failure

14%

Windows 2000Windows 2000

NT4NT4

App: Install / Maintenance

13%

System Component Failure 5%

Other

4%

App: Failure

6%

OS & Driver / Adaptor:

Failure 9%

OS: config12%

Other7%

OS: Upgrade / SP / Hotfix

37%

Hardware: Install/Config

7%

PlannedPlanned76%76% UnplanneUnplanne

dd24%24%

Fewer Reboots

Fewer RebootsOS Upgrade/ SP/ Hotfix

25%

App: Install / Maintenance 32%

System Component Failure 5%

App: Unresponsive/ Unstable 1%

Security7%

System Unresponsive 1%

Hardware: Install/ Maintenance 16%

OS: Config12%

App: Install / Maintenance 8%

OS: Config

7%

Hardware Install/Config 3%

Preventive Reboots

20%

App: Failure

21%

PlannedPlanned65%65%

UnplannedUnplanned35%35%

OS Upgrade/ SP / Hotfix

27%

Server Availability Improvement RTM

RC1

99.80%99.83%99.85%99.88%99.90%99.93%99.95%99.98%

100.00%

NT Windows 2000 Windows Server 2003

Windows Windows Server 2003Server 2003

PlannedPlanned86%86%

UnplanneUnplannedd

14%14%

Page 19: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Reducing Failures in WindowsReducing Failures in Windows

3rd Party Filter driver

27%

3rd Party Device driver

33%

Hardware4%

Networking14%

Win32k10%

File Systems7%

Other3% HAL

2%

Source: OTG and Crashes sent to Source: OTG and Crashes sent to Online Crash AnalysisOnline Crash Analysis

Crashes < 1% of reboots on Crashes < 1% of reboots on XPXP

Windows 2000Windows 2000

Windows XP RTMWindows XP RTM

NT4NT4Networking, Win32K, File

system43%

In box device Drivers

16%

3rd party device

drivers 16%

Hardware 13%

3rd party filter drivers

12%

Fewer Crashes

Fewer Crashes

Disk1%

File Systems<1%

3rd Party Filter driver

<1%

Win32k3%

Kernel2%

Networking2%

USB Core1%

Registry2%

3rd party Device3rd party DeviceDriverDriver89%89%

Page 20: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Stratus FTServer with guaranteed 100% availabilityHP offers 99.99% uptime guarantee (10X better than NT4)Unisys offers 99.99% and higher availability

Madrid Stock Exchange achieving 99.999% availabilityMS.com has top-rated availability as reported by KeynoteContinuous availability for City of San Diego 911 service

Rock-Solid Kernel: Mean Time Between Crashes 3-10 yearsEliminated 80% of Reboots/Server/Year in Windows 2003XP has much higher reliability than Win 9X

Now Now (Windows XP & (Windows XP &

2003)2003)

Multi Server FT Multi Server FT AvailabilityAvailability

Fault TolerantFault TolerantOfferingsOfferings

Downtime Per YearDowntime Per Year99.999% = 5 Min 99.999% = 5 Min 99.99% = 52 Min99.99% = 52 Min

99.9% = 8.8 Hrs99.9% = 8.8 Hrs

Frequent Reboots on NT4 and its early SPsReboots common for Win9X

Then Then (Win 9X & NT4.0)(Win 9X & NT4.0)

Customers can achieve 99.99% availability on Server 2003Cinergy achieved 99.99% server availabilityMSNBC achieved 99.98% during the Winter Olympics

Single ServerSingle ServerAvailabilityAvailability

Results

Reliability: Where are We?Reliability: Where are We?Reliability: Where are We?Reliability: Where are We?

Page 21: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Raising the Reliability Bar Raising the Reliability Bar on Longhornon Longhorn

Page 22: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Longhorn Reliability ObjectivesLonghorn Reliability Objectives

Theme: No loss of work, time, data or controlTheme: No loss of work, time, data or control

Motto: No Hangs, No Crashes, No RebootsMotto: No Hangs, No Crashes, No Reboots

Our focus is on reducing user disruptionsOur focus is on reducing user disruptions Reducing failures, crashes and hangsReducing failures, crashes and hangs

Reducing required shutdowns for software installationsReducing required shutdowns for software installations

How we raised the bar on Longhorn reliabilityHow we raised the bar on Longhorn reliability New processes to minimize bugs and design issuesNew processes to minimize bugs and design issues Enhanced feedback for identifying product problems during Enhanced feedback for identifying product problems during

developmentdevelopment New reliability featuresNew reliability features

Page 23: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Reliability During the Reliability During the Development ProcessDevelopment Process

Page 24: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Longhorn Development Process Longhorn Development Process Reliability GoalsReliability Goals New LH components designed for reliabilityNew LH components designed for reliability

Make all LH components measurably more reliableMake all LH components measurably more reliable No Hang, No Crash, No RebootNo Hang, No Crash, No Reboot Focus on data-driven initiativesFocus on data-driven initiatives

Communicate LH Reliability priorities across Communicate LH Reliability priorities across WindowsWindows Instill sense of reliability ownership on all teams teamInstill sense of reliability ownership on all teams team Track progress of teams against reliability deliverablesTrack progress of teams against reliability deliverables

Ensure component teams focus on measurable Ensure component teams focus on measurable customer improvementscustomer improvements

Page 25: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Longhorn Reliability ProcessLonghorn Reliability Process

Set development milestone exit criteriaSet development milestone exit criteria Design issues resolved, bugs fixedDesign issues resolved, bugs fixed Bar gets higher with each milestoneBar gets higher with each milestone

Use tools to track progress towards exit criteriaUse tools to track progress towards exit criteria e.g., Detect if DLLs are unloaded by services in shared e.g., Detect if DLLs are unloaded by services in shared

processesprocesses

Use standard release management process for Use standard release management process for enforcementenforcement Reliability issues get specific focus if not resolvedReliability issues get specific focus if not resolved

Milestone is not exited unless all reliability criteria are Milestone is not exited unless all reliability criteria are metmet

Page 26: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

LBP uses real pre-release system data and turns LBP uses real pre-release system data and turns observed Watson User Crashes and Hangs into observed Watson User Crashes and Hangs into actionable bugs for Windows feature teams.actionable bugs for Windows feature teams.

ProcessProcess Longhorn Watson problems identifiedLonghorn Watson problems identified Bugs automatically created and assignedBugs automatically created and assigned Bug Status / Resolution Tracked by Release ManagementBug Status / Resolution Tracked by Release Management Fixes validated to ensure qualityFixes validated to ensure quality

Longhorn Bug Process (LBP)Longhorn Bug Process (LBP)

Page 27: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Windows Reliability Release CriteriaWindows Reliability Release Criteria

Windows 2000Windows 2000

W2K Variability BandsW2K Variability Bands

Windows Server 2003Windows Server 2003

B3B3 RC1RC1 RC2RC2 RC3RC3 RTMRTM

OS

Ava

ilab

ility

OS

Ava

ilab

ility

Windows Server 2003 Variability BandsWindows Server 2003 Variability Bands

Product reliability readiness assessment in customer environments prior to shipping softwareProduct reliability readiness assessment in customer environments prior to shipping software Reliability target established based on expected improvement over prior OS versionsReliability target established based on expected improvement over prior OS versions Reliability assessment of production serversReliability assessment of production servers Reliability assessment of customer desktops/laptopsReliability assessment of customer desktops/laptops Release readiness based on user reliability attributesRelease readiness based on user reliability attributes

Page 28: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

The Digital The Digital Feedback LoopFeedback Loop

Page 29: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

The Digital Feedback Loop: The Digital Feedback Loop: Windows Error ReportingWindows Error Reporting

Errors are reported to Microsoft in real-time by Errors are reported to Microsoft in real-time by customer choice (crashes, hangs) customer choice (crashes, hangs)

Automatic analysis and signature matching to Automatic analysis and signature matching to known issuesknown issues

Dedicated Microsoft resources for finding solutions Dedicated Microsoft resources for finding solutions to common  system and application crashes and to common  system and application crashes and hangs in Windows and 3rd party software hangs in Windows and 3rd party software

Known fixes provided to customers in real-timeKnown fixes provided to customers in real-time

Page 30: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

The Digital Feedback Loop: New LH The Digital Feedback Loop: New LH Reliability Analysis ComponentReliability Analysis Component

Internet

MS

Feedback

RAC - Consumer

Consumer Client

ERCWDI

Via SQM

Watson

RAC - Backend

Release Criteria

Product Improvement

Identification of New Problem Areas

OCA

Corporate Client

Server

RAC - Corporate

MOM

Server Client

Reliability Analysis Aggregate Component

RAC is a system reliability RAC is a system reliability analysis & reporting analysis & reporting componentcomponent

Analyzes, aggregates, and Analyzes, aggregates, and correlates user correlates user disruptions for the OS and disruptions for the OS and applicationsapplications

Provides user disruption Provides user disruption and cause trackingand cause tracking

Exposes reliability metrics Exposes reliability metrics and results to:and results to:

UsersUsers

Health monitoring Health monitoring applications applications

MS Product FeedbackMS Product Feedback

Page 31: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Longhorn Reliability Longhorn Reliability FeaturesFeatures

Page 32: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Longhorn Reliability FeaturesLonghorn Reliability Features

Longhorn reduces the frequency and impact of user Longhorn reduces the frequency and impact of user disruptionsdisruptions Auto-diagnosis and auto-correctionAuto-diagnosis and auto-correction Protect user data when failures occurProtect user data when failures occur Provide fixes for known crashes and hangsProvide fixes for known crashes and hangs Minimize reboots when installing softwareMinimize reboots when installing software

Enhanced instrumentation enables improved reliabilityEnhanced instrumentation enables improved reliability Tracking and analysis of OS and application healthTracking and analysis of OS and application health Feedback to Microsoft for product improvement, driving both Feedback to Microsoft for product improvement, driving both

near term fixes and new featuresnear term fixes and new features

Page 33: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Enhanced InstrumentationEnhanced Instrumentation

Foundation for all diagnosis and repair of user Foundation for all diagnosis and repair of user disruptionsdisruptions

Better tracking of OS and component state changesBetter tracking of OS and component state changes InstallationsInstallations

App, drivers, OS patchesApp, drivers, OS patches Runtime changesRuntime changes

OS, applications, services, driversOS, applications, services, drivers

More complete and detailed root cause information More complete and detailed root cause information for many types of user disruptionsfor many types of user disruptions E.g., Hangs, reboots, disk failure, memory failure, resource E.g., Hangs, reboots, disk failure, memory failure, resource

exhaustion, non-bootable systemsexhaustion, non-bootable systems Specific failure information for top in-box experiencesSpecific failure information for top in-box experiences

IE, Shell, Windows Media Player, etc.IE, Shell, Windows Media Player, etc.

Page 34: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Auto-diagnosis and Auto-correctionAuto-diagnosis and Auto-correction

Longhorn can diagnose and (where appropriate) Longhorn can diagnose and (where appropriate) repair common failuresrepair common failures Start-up Repair (unbootable systems)Start-up Repair (unbootable systems) Hardware diagnosticsHardware diagnostics

Disk FailureDisk Failure RAMRAM

Resource exhaustionResource exhaustion

Page 35: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Startup RepairStartup Repair

Provide automatic diagnosis and recovery for Provide automatic diagnosis and recovery for unbootable systemsunbootable systems Empower end users with the ability to automatically Empower end users with the ability to automatically

recover from recover from ≥ ≥ 80% of known causes for unbootable 80% of known causes for unbootable systems systems

Minimize end-user impact (data loss, downtime) when Minimize end-user impact (data loss, downtime) when fixing unbootable systemsfixing unbootable systems

Provide support organizations with diagnostics to Provide support organizations with diagnostics to facilitate user recovery and reduce call timesfacilitate user recovery and reduce call times

Page 36: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Hardware DiagnosticsHardware Diagnostics

Disk Failure Diagnostics GoalsDisk Failure Diagnostics Goals Minimize data loss from catastrophic failureMinimize data loss from catastrophic failure

Turn unplanned failures into planned maintenanceTurn unplanned failures into planned maintenance Main focus on clientMain focus on client

Memory Diagnostics GoalsMemory Diagnostics Goals Prevents recurring crashes due to bad RAMPrevents recurring crashes due to bad RAM

OS avoids using the bad physical pages of memory on next OS avoids using the bad physical pages of memory on next bootboot

Integrate diagnostics into OSIntegrate diagnostics into OS Make results easy to understand for end user and IT ProMake results easy to understand for end user and IT Pro

Page 37: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Resource Exhaustion DiagnosisResource Exhaustion Diagnosis

Give users control of their system by allowing them Give users control of their system by allowing them to take action before a low resource condition to take action before a low resource condition

impacts themimpacts them Automatic detection and diagnosis of near-exhaustion of Automatic detection and diagnosis of near-exhaustion of

commit limit and memory leakscommit limit and memory leaks Provide options for manual and automatic resolution to Provide options for manual and automatic resolution to

avoid exhaustionavoid exhaustion

Collect data on the exhaustion of the resources for Collect data on the exhaustion of the resources for future product improvement future product improvement

Page 38: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Leveraging Feedback to Solve Leveraging Feedback to Solve Customer Problems in Real TimeCustomer Problems in Real Time Fixes provided for known OS and Application Fixes provided for known OS and Application

crashes automatically when a crash occurscrashes automatically when a crash occurs

New to Longhorn, fixes also provided for known New to Longhorn, fixes also provided for known application hangsapplication hangs

Customer hangs and crashes prevented through Customer hangs and crashes prevented through regular Auto Updates based on detected problemsregular Auto Updates based on detected problems

Page 39: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

Minimize Reboots when Installing Minimize Reboots when Installing SoftwareSoftware

Provide functionality required by installers and Windows Update to minimize reboots

Shutdown only required applications and services Automatically detect and shutdown services in shared

processes with a file in use Prevent the need for a machine restart after apps or

services have been shutdown Group application, service and machine restarts when

possible Leverage app “freeze-dry” functionality to return user to the

state they were in before the restart

Users experience minimum disruption for application and patch installs

Users experience minimum disruption for application and patch installs

Page 40: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

SummarySummary

Longhorn is built for reliabilityLonghorn is built for reliability Development processDevelopment process Pre-release reliability assessmentPre-release reliability assessment Auto-diagnosis of problemsAuto-diagnosis of problems Problem prevention and real-time fixesProblem prevention and real-time fixes

No more waiting for SP3: Longhorn is reliable out of No more waiting for SP3: Longhorn is reliable out of the gatethe gate

More info at More info at http://reliability/publichttp://reliability/public

Page 41: Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow, Group Program Manager Windows Reliability Microsoft Corporation.

© 2002 Microsoft Corporation. All rights reserved.© 2002 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.