CASEIN AND SLEEP GABBY KULIKOWSKI KYLE BRENNAN ANDREW GARZIA.
Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow,...
-
Upload
alexis-lambert -
Category
Documents
-
view
214 -
download
2
Transcript of Longhorn: Designed and Built for Reliability Mario Garzia, Director of Development Björn Levidow,...
Longhorn: Designed and Longhorn: Designed and Built for ReliabilityBuilt for Reliability
Mario Garzia, Director of DevelopmentMario Garzia, Director of DevelopmentBjörn Levidow, Group Program ManagerBjörn Levidow, Group Program Manager
Windows ReliabilityWindows ReliabilityMicrosoft CorporationMicrosoft Corporation
AgendaAgenda
What is ReliabilityWhat is Reliability Windows Approach to Reliability Windows Approach to Reliability
Engineering Engineering How is Windows DoingHow is Windows Doing Reliability during the Development ProcessReliability during the Development Process The Digital Feedback LoopThe Digital Feedback Loop A Look at Longhorn Reliability FeaturesA Look at Longhorn Reliability Features SummarySummary
What is Reliability?What is Reliability?
Reliability Delights CustomersReliability Delights Customers
Extensive analysis of customer feedback…
Annual marketing surveys covering all MS products
• Approximately 33K customers
• Comments from 3,000 very satisfied and very dissatisfied customers
• Product quality one of several categories tested
…shows poor reliability decreases satisfaction…
Data shows reliability is the greatest source of product dissatisfaction
More than 50% of dissatisfied customers blame reliability
…while great reliability is a customer delighter
Many customers cite reliability as their reason for product satisfaction
Reliability features among those with highest customer interest and most likely to lead to upgrade
Longhorn is built to delight our customers!Longhorn is built to delight our customers!
Customer’s Perspective on ReliabilityCustomer’s Perspective on Reliability
Reliability involves more than just crashes and Reliability involves more than just crashes and hangs. It involves: hangs. It involves: Bugs, errors, faultsBugs, errors, faults Application crashes & blue screensApplication crashes & blue screens Patches, hotfixes & service packsPatches, hotfixes & service packs Reliable, stable, predictable, …Reliable, stable, predictable, … Freezes, hangs & lock-upsFreezes, hangs & lock-ups Restarts, reboots , downtime periodsRestarts, reboots , downtime periods Readiness for RTM, finished, complete, mature, …Readiness for RTM, finished, complete, mature, … Multiple installs, rebuilds, reformat … to fix problemsMultiple installs, rebuilds, reformat … to fix problems Being able to run needed softwareBeing able to run needed software Data loss!Data loss!
Microsoft’s Customer-Focused Reliability Attributes
Attribute Definition Examples
Resilient
The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions
Microsoft’s Customer-Focused Reliability Attributes
Attribute Definition Examples
crashes, hangs …
Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss
Resilient
The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions
Microsoft’s Customer-Focused Reliability Attributes
Attribute Definition Examples
crashes, hangs …data corruption
Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss
Resilient
The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions
Controlled
Provides timely and expected service Provides timely and expected service whenever neededwhenever needed
Microsoft’s Customer-Focused Reliability Attributes
Attribute Definition Examples
crashes, hangs …
degraded response
data corruption
Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss
Resilient
The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions
UndisruptableRequired changes and upgrades do Required changes and upgrades do not impact the servicenot impact the service
Controlled
Provides timely and expected service Provides timely and expected service whenever neededwhenever needed
Microsoft’s Customer-Focused Reliability Attributes
Attribute Definition Examples
crashes, hangs …
degraded response
update disruptions
data corruption
Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss
Resilient
The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions
Production Ready
At release the system contains a At release the system contains a minimum number of bugs, requiring a minimum number of bugs, requiring a limited number of predictable limited number of predictable patches/fixespatches/fixes
UndisruptableRequired changes and upgrades do Required changes and upgrades do not impact the servicenot impact the service
Controlled
Provides timely and expected service Provides timely and expected service whenever neededwhenever needed
Microsoft’s Customer-Focused Reliability Attributes
Attribute Definition Examples
crashes, hangs …
degraded response
update disruptions
patch size, frequency
data corruption
Recoverable After disruption the system is easily After disruption the system is easily restored to a previously known state restored to a previously known state with no data losswith no data loss
Resilient
The system continues to provide The system continues to provide service in the face of internal or service in the face of internal or external disruptionsexternal disruptions
Predictable It works as advertised, what worked It works as advertised, what worked before works nowbefore works now
Production Ready
At release the system contains a At release the system contains a minimum number of bugs, requiring a minimum number of bugs, requiring a limited number of predictable limited number of predictable patches/fixespatches/fixes
UndisruptableRequired changes and upgrades do Required changes and upgrades do not impact the servicenot impact the service
Controlled
Provides timely and expected service Provides timely and expected service whenever neededwhenever needed
Microsoft’s Customer-Focused Reliability Attributes
Attribute Definition Examples
crashes, hangs …
degraded response
update disruptions
patch size, frequency
compatibility failures
data corruption
Windows Approach to Windows Approach to Reliability EngineeringReliability Engineering
Engineering Software for ReliabilityEngineering Software for ReliabilityReliability improvement throughout the product’s lifecycleReliability improvement throughout the product’s lifecycle
Design for ReliabilityDesign for Reliability Minimize user disruptionsMinimize user disruptions Architectural reliability improvementsArchitectural reliability improvements Enhanced recovery & resiliency Enhanced recovery & resiliency
Product TestingProduct Testing Reliability-focused testingReliability-focused testing Reliability release criteriaReliability release criteria
Customer FeedbackCustomer Feedback Data-driven, realistic assessmentData-driven, realistic assessment Problem prioritizationProblem prioritization Input from many sourcesInput from many sources
Product Updates & Best Practices Product Updates & Best Practices Problem prevention and product fixesProblem prevention and product fixes Best practices Best practices
Design for Design for ReliabilityReliability
Product Product UpdatesUpdates
WindowsWindowsReliabilityReliability
EngineeringEngineering
ProductProductTestingTesting
Customer Customer FeedbackFeedback
Assuring Product ReliabilityAssuring Product ReliabilityReliability Built into all Stages of Product DevelopmentReliability Built into all Stages of Product Development
Reliability goals and objectives
Reliability engineering practices
Windows LongHaul reliability tracking
Static and Dynamic tools for code quality, e.g., Prefast, Prefix
Production environment reliability validation prior to release
Reliability tracking at the component level
Reliability• MTTShutdown• MTTCrash
Availability• MTTRestore• Uptime• Downtime
Causes of disruptions
• Crashes• Reboots• Hangs
Watson/OCA
Service Quality Monitoring
Microsoft Reliability Analysis Service
Reliability planning and monitoring
Reliability testingReliability release
criteria
Production environment
tracking
Products for Reliability
tracking in the field
Design and developmentDesign and development TestingTesting ReleaseRelease Post-releasePost-release
Ongoing Customer AssessmentOngoing Customer AssessmentCustomer Measurement ProgramCustomer Measurement Program
• Customer Measurement Program initiated in Windows NT 4.0
• Provides detailed information for improving subsequent versions of Windows
• Longhorn reliability improvements are the direct results of these measurements
Customer Measurement Program Details
Production server measurements
• 4,600 NT4 servers at 15 sites• More than 10,000 Windows 2000 servers at
more than 30 sites• Windows Server 2003 at more than 50
customer sites• Pre-release reliability assessment of JDP
customers• Ship criteria based on meeting reliability
objectives
Client corporate and consumer measurements
• Crash and Hang failure information from millions of customer world-wide
• More than 10,000 consumer systems• Thousands of corporate desktops/laptops
measured
How is Windows Doing?How is Windows Doing?
Improved Reliability and Availability Improved Reliability and Availability
Network connectivity/ power 2%
OS & Driver / Adaptor: Failure
14%
Windows 2000Windows 2000
NT4NT4
App: Install / Maintenance
13%
System Component Failure 5%
Other
4%
App: Failure
6%
OS & Driver / Adaptor:
Failure 9%
OS: config12%
Other7%
OS: Upgrade / SP / Hotfix
37%
Hardware: Install/Config
7%
PlannedPlanned76%76% UnplanneUnplanne
dd24%24%
Fewer Reboots
Fewer RebootsOS Upgrade/ SP/ Hotfix
25%
App: Install / Maintenance 32%
System Component Failure 5%
App: Unresponsive/ Unstable 1%
Security7%
System Unresponsive 1%
Hardware: Install/ Maintenance 16%
OS: Config12%
App: Install / Maintenance 8%
OS: Config
7%
Hardware Install/Config 3%
Preventive Reboots
20%
App: Failure
21%
PlannedPlanned65%65%
UnplannedUnplanned35%35%
OS Upgrade/ SP / Hotfix
27%
Server Availability Improvement RTM
RC1
99.80%99.83%99.85%99.88%99.90%99.93%99.95%99.98%
100.00%
NT Windows 2000 Windows Server 2003
Windows Windows Server 2003Server 2003
PlannedPlanned86%86%
UnplanneUnplannedd
14%14%
Reducing Failures in WindowsReducing Failures in Windows
3rd Party Filter driver
27%
3rd Party Device driver
33%
Hardware4%
Networking14%
Win32k10%
File Systems7%
Other3% HAL
2%
Source: OTG and Crashes sent to Source: OTG and Crashes sent to Online Crash AnalysisOnline Crash Analysis
Crashes < 1% of reboots on Crashes < 1% of reboots on XPXP
Windows 2000Windows 2000
Windows XP RTMWindows XP RTM
NT4NT4Networking, Win32K, File
system43%
In box device Drivers
16%
3rd party device
drivers 16%
Hardware 13%
3rd party filter drivers
12%
Fewer Crashes
Fewer Crashes
Disk1%
File Systems<1%
3rd Party Filter driver
<1%
Win32k3%
Kernel2%
Networking2%
USB Core1%
Registry2%
3rd party Device3rd party DeviceDriverDriver89%89%
Stratus FTServer with guaranteed 100% availabilityHP offers 99.99% uptime guarantee (10X better than NT4)Unisys offers 99.99% and higher availability
Madrid Stock Exchange achieving 99.999% availabilityMS.com has top-rated availability as reported by KeynoteContinuous availability for City of San Diego 911 service
Rock-Solid Kernel: Mean Time Between Crashes 3-10 yearsEliminated 80% of Reboots/Server/Year in Windows 2003XP has much higher reliability than Win 9X
Now Now (Windows XP & (Windows XP &
2003)2003)
Multi Server FT Multi Server FT AvailabilityAvailability
Fault TolerantFault TolerantOfferingsOfferings
Downtime Per YearDowntime Per Year99.999% = 5 Min 99.999% = 5 Min 99.99% = 52 Min99.99% = 52 Min
99.9% = 8.8 Hrs99.9% = 8.8 Hrs
Frequent Reboots on NT4 and its early SPsReboots common for Win9X
Then Then (Win 9X & NT4.0)(Win 9X & NT4.0)
Customers can achieve 99.99% availability on Server 2003Cinergy achieved 99.99% server availabilityMSNBC achieved 99.98% during the Winter Olympics
Single ServerSingle ServerAvailabilityAvailability
Results
Reliability: Where are We?Reliability: Where are We?Reliability: Where are We?Reliability: Where are We?
Raising the Reliability Bar Raising the Reliability Bar on Longhornon Longhorn
Longhorn Reliability ObjectivesLonghorn Reliability Objectives
Theme: No loss of work, time, data or controlTheme: No loss of work, time, data or control
Motto: No Hangs, No Crashes, No RebootsMotto: No Hangs, No Crashes, No Reboots
Our focus is on reducing user disruptionsOur focus is on reducing user disruptions Reducing failures, crashes and hangsReducing failures, crashes and hangs
Reducing required shutdowns for software installationsReducing required shutdowns for software installations
How we raised the bar on Longhorn reliabilityHow we raised the bar on Longhorn reliability New processes to minimize bugs and design issuesNew processes to minimize bugs and design issues Enhanced feedback for identifying product problems during Enhanced feedback for identifying product problems during
developmentdevelopment New reliability featuresNew reliability features
Reliability During the Reliability During the Development ProcessDevelopment Process
Longhorn Development Process Longhorn Development Process Reliability GoalsReliability Goals New LH components designed for reliabilityNew LH components designed for reliability
Make all LH components measurably more reliableMake all LH components measurably more reliable No Hang, No Crash, No RebootNo Hang, No Crash, No Reboot Focus on data-driven initiativesFocus on data-driven initiatives
Communicate LH Reliability priorities across Communicate LH Reliability priorities across WindowsWindows Instill sense of reliability ownership on all teams teamInstill sense of reliability ownership on all teams team Track progress of teams against reliability deliverablesTrack progress of teams against reliability deliverables
Ensure component teams focus on measurable Ensure component teams focus on measurable customer improvementscustomer improvements
Longhorn Reliability ProcessLonghorn Reliability Process
Set development milestone exit criteriaSet development milestone exit criteria Design issues resolved, bugs fixedDesign issues resolved, bugs fixed Bar gets higher with each milestoneBar gets higher with each milestone
Use tools to track progress towards exit criteriaUse tools to track progress towards exit criteria e.g., Detect if DLLs are unloaded by services in shared e.g., Detect if DLLs are unloaded by services in shared
processesprocesses
Use standard release management process for Use standard release management process for enforcementenforcement Reliability issues get specific focus if not resolvedReliability issues get specific focus if not resolved
Milestone is not exited unless all reliability criteria are Milestone is not exited unless all reliability criteria are metmet
LBP uses real pre-release system data and turns LBP uses real pre-release system data and turns observed Watson User Crashes and Hangs into observed Watson User Crashes and Hangs into actionable bugs for Windows feature teams.actionable bugs for Windows feature teams.
ProcessProcess Longhorn Watson problems identifiedLonghorn Watson problems identified Bugs automatically created and assignedBugs automatically created and assigned Bug Status / Resolution Tracked by Release ManagementBug Status / Resolution Tracked by Release Management Fixes validated to ensure qualityFixes validated to ensure quality
Longhorn Bug Process (LBP)Longhorn Bug Process (LBP)
Windows Reliability Release CriteriaWindows Reliability Release Criteria
Windows 2000Windows 2000
W2K Variability BandsW2K Variability Bands
Windows Server 2003Windows Server 2003
B3B3 RC1RC1 RC2RC2 RC3RC3 RTMRTM
OS
Ava
ilab
ility
OS
Ava
ilab
ility
Windows Server 2003 Variability BandsWindows Server 2003 Variability Bands
Product reliability readiness assessment in customer environments prior to shipping softwareProduct reliability readiness assessment in customer environments prior to shipping software Reliability target established based on expected improvement over prior OS versionsReliability target established based on expected improvement over prior OS versions Reliability assessment of production serversReliability assessment of production servers Reliability assessment of customer desktops/laptopsReliability assessment of customer desktops/laptops Release readiness based on user reliability attributesRelease readiness based on user reliability attributes
The Digital The Digital Feedback LoopFeedback Loop
The Digital Feedback Loop: The Digital Feedback Loop: Windows Error ReportingWindows Error Reporting
Errors are reported to Microsoft in real-time by Errors are reported to Microsoft in real-time by customer choice (crashes, hangs) customer choice (crashes, hangs)
Automatic analysis and signature matching to Automatic analysis and signature matching to known issuesknown issues
Dedicated Microsoft resources for finding solutions Dedicated Microsoft resources for finding solutions to common system and application crashes and to common system and application crashes and hangs in Windows and 3rd party software hangs in Windows and 3rd party software
Known fixes provided to customers in real-timeKnown fixes provided to customers in real-time
The Digital Feedback Loop: New LH The Digital Feedback Loop: New LH Reliability Analysis ComponentReliability Analysis Component
Internet
MS
Feedback
RAC - Consumer
Consumer Client
ERCWDI
Via SQM
Watson
RAC - Backend
Release Criteria
Product Improvement
Identification of New Problem Areas
OCA
Corporate Client
Server
RAC - Corporate
MOM
Server Client
Reliability Analysis Aggregate Component
RAC is a system reliability RAC is a system reliability analysis & reporting analysis & reporting componentcomponent
Analyzes, aggregates, and Analyzes, aggregates, and correlates user correlates user disruptions for the OS and disruptions for the OS and applicationsapplications
Provides user disruption Provides user disruption and cause trackingand cause tracking
Exposes reliability metrics Exposes reliability metrics and results to:and results to:
UsersUsers
Health monitoring Health monitoring applications applications
MS Product FeedbackMS Product Feedback
Longhorn Reliability Longhorn Reliability FeaturesFeatures
Longhorn Reliability FeaturesLonghorn Reliability Features
Longhorn reduces the frequency and impact of user Longhorn reduces the frequency and impact of user disruptionsdisruptions Auto-diagnosis and auto-correctionAuto-diagnosis and auto-correction Protect user data when failures occurProtect user data when failures occur Provide fixes for known crashes and hangsProvide fixes for known crashes and hangs Minimize reboots when installing softwareMinimize reboots when installing software
Enhanced instrumentation enables improved reliabilityEnhanced instrumentation enables improved reliability Tracking and analysis of OS and application healthTracking and analysis of OS and application health Feedback to Microsoft for product improvement, driving both Feedback to Microsoft for product improvement, driving both
near term fixes and new featuresnear term fixes and new features
Enhanced InstrumentationEnhanced Instrumentation
Foundation for all diagnosis and repair of user Foundation for all diagnosis and repair of user disruptionsdisruptions
Better tracking of OS and component state changesBetter tracking of OS and component state changes InstallationsInstallations
App, drivers, OS patchesApp, drivers, OS patches Runtime changesRuntime changes
OS, applications, services, driversOS, applications, services, drivers
More complete and detailed root cause information More complete and detailed root cause information for many types of user disruptionsfor many types of user disruptions E.g., Hangs, reboots, disk failure, memory failure, resource E.g., Hangs, reboots, disk failure, memory failure, resource
exhaustion, non-bootable systemsexhaustion, non-bootable systems Specific failure information for top in-box experiencesSpecific failure information for top in-box experiences
IE, Shell, Windows Media Player, etc.IE, Shell, Windows Media Player, etc.
Auto-diagnosis and Auto-correctionAuto-diagnosis and Auto-correction
Longhorn can diagnose and (where appropriate) Longhorn can diagnose and (where appropriate) repair common failuresrepair common failures Start-up Repair (unbootable systems)Start-up Repair (unbootable systems) Hardware diagnosticsHardware diagnostics
Disk FailureDisk Failure RAMRAM
Resource exhaustionResource exhaustion
Startup RepairStartup Repair
Provide automatic diagnosis and recovery for Provide automatic diagnosis and recovery for unbootable systemsunbootable systems Empower end users with the ability to automatically Empower end users with the ability to automatically
recover from recover from ≥ ≥ 80% of known causes for unbootable 80% of known causes for unbootable systems systems
Minimize end-user impact (data loss, downtime) when Minimize end-user impact (data loss, downtime) when fixing unbootable systemsfixing unbootable systems
Provide support organizations with diagnostics to Provide support organizations with diagnostics to facilitate user recovery and reduce call timesfacilitate user recovery and reduce call times
Hardware DiagnosticsHardware Diagnostics
Disk Failure Diagnostics GoalsDisk Failure Diagnostics Goals Minimize data loss from catastrophic failureMinimize data loss from catastrophic failure
Turn unplanned failures into planned maintenanceTurn unplanned failures into planned maintenance Main focus on clientMain focus on client
Memory Diagnostics GoalsMemory Diagnostics Goals Prevents recurring crashes due to bad RAMPrevents recurring crashes due to bad RAM
OS avoids using the bad physical pages of memory on next OS avoids using the bad physical pages of memory on next bootboot
Integrate diagnostics into OSIntegrate diagnostics into OS Make results easy to understand for end user and IT ProMake results easy to understand for end user and IT Pro
Resource Exhaustion DiagnosisResource Exhaustion Diagnosis
Give users control of their system by allowing them Give users control of their system by allowing them to take action before a low resource condition to take action before a low resource condition
impacts themimpacts them Automatic detection and diagnosis of near-exhaustion of Automatic detection and diagnosis of near-exhaustion of
commit limit and memory leakscommit limit and memory leaks Provide options for manual and automatic resolution to Provide options for manual and automatic resolution to
avoid exhaustionavoid exhaustion
Collect data on the exhaustion of the resources for Collect data on the exhaustion of the resources for future product improvement future product improvement
Leveraging Feedback to Solve Leveraging Feedback to Solve Customer Problems in Real TimeCustomer Problems in Real Time Fixes provided for known OS and Application Fixes provided for known OS and Application
crashes automatically when a crash occurscrashes automatically when a crash occurs
New to Longhorn, fixes also provided for known New to Longhorn, fixes also provided for known application hangsapplication hangs
Customer hangs and crashes prevented through Customer hangs and crashes prevented through regular Auto Updates based on detected problemsregular Auto Updates based on detected problems
Minimize Reboots when Installing Minimize Reboots when Installing SoftwareSoftware
Provide functionality required by installers and Windows Update to minimize reboots
Shutdown only required applications and services Automatically detect and shutdown services in shared
processes with a file in use Prevent the need for a machine restart after apps or
services have been shutdown Group application, service and machine restarts when
possible Leverage app “freeze-dry” functionality to return user to the
state they were in before the restart
Users experience minimum disruption for application and patch installs
Users experience minimum disruption for application and patch installs
SummarySummary
Longhorn is built for reliabilityLonghorn is built for reliability Development processDevelopment process Pre-release reliability assessmentPre-release reliability assessment Auto-diagnosis of problemsAuto-diagnosis of problems Problem prevention and real-time fixesProblem prevention and real-time fixes
No more waiting for SP3: Longhorn is reliable out of No more waiting for SP3: Longhorn is reliable out of the gatethe gate
More info at More info at http://reliability/publichttp://reliability/public
© 2002 Microsoft Corporation. All rights reserved.© 2002 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.