Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

39
Strategies for Fault-Tolerant Strategies for Fault-Tolerant Computing Computing For Windows Server 2003 For Windows Server 2003 Mehmet Altan AÇIKGÖZ Mehmet Altan AÇIKGÖZ Ercan SARAÇ Ercan SARAÇ

Transcript of Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Page 1: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Strategies for Fault-Tolerant ComputingStrategies for Fault-Tolerant Computing

For Windows Server 2003For Windows Server 2003

Mehmet Altan AÇIKGÖZMehmet Altan AÇIKGÖZ

Ercan SARAÇErcan SARAÇ

Page 2: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

AgendaAgenda

IntroductionIntroductionFault-tolerant serversFault-tolerant serversFault tolerance on WindowsFault tolerance on WindowsUnique BenefitsUnique BenefitsClusteringClusteringComparison of Clusters & Stratus ftServerComparison of Clusters & Stratus ftServerFtServer FtServer Software AvailabilitySoftware AvailabilitySummarySummary

Page 3: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

IntroductionIntroduction

Availability Availability the percentage of time that a system is capable of serving itsthe percentage of time that a system is capable of serving its intended intended

function.function.

Correlation Between Availability and Annual DowntimeCorrelation Between Availability and Annual Downtime Availability Annual Downtime

99% 87.6 hours

99.9% 8.76 hours

99.99% 52.5 minutes

99.999% 5.25 minutes

Page 4: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Importance of AvailabilityImportance of Availability

AAvailability of mission-critical information systems is often tied vailability of mission-critical information systems is often tied

directly to business performance or revenuedirectly to business performance or revenue

Average Cost of Unplanned Downtime for Various IndustriesAverage Cost of Unplanned Downtime for Various Industries

Industry Sector Hourly Cost of Downtime

Manufacturing $28,000

Transportation $90,000

Retail, Catalog Sales $90,000

Retail, Home Shopping $113,000

Media, Pay Per View $1,100,000

Banking Data Center $2,500,000

Financial, Credit Card Processing $2,600,000

Brokerage $6,500,000

Page 5: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

The Availability Equation: People, The Availability Equation: People, Process, and TechnologyProcess, and Technology

Page 6: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Fault-Tolerant ServersFault-Tolerant Servers

minimizing causes of downtime is through the use of minimizing causes of downtime is through the use of fault-tolerant servers, combined with software that fault-tolerant servers, combined with software that supports them supports them

If a primary component fails, the secondary component If a primary component fails, the secondary component takes over in a process that is seamless to the takes over in a process that is seamless to the

application running on the server.application running on the server.

Page 7: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Fault-Tolerant ServersFault-Tolerant Servers

Most high-end servers employ at least some redundant Most high-end servers employ at least some redundant components to eliminate common points of failure but components to eliminate common points of failure but they will still fail when a nonredundant component such they will still fail when a nonredundant component such as a microprocessor or memory controller fails as a microprocessor or memory controller fails

True fault-tolerant servers, however, employ complete True fault-tolerant servers, however, employ complete redundancy across redundancy across allall system components, ensuring that system components, ensuring that no single point of failure can compromise system no single point of failure can compromise system availability.availability.

Page 8: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Traditional Barriers to AdoptionTraditional Barriers to Adoption

Extremely high hardware costsExtremely high hardware costs TThe typical cost for an entry-level fault-tolerant server he typical cost for an entry-level fault-tolerant server running a proprietary operating system was $250,000running a proprietary operating system was $250,000

priorprior to 2000 to 2000

Complexity and expense of writing Complexity and expense of writing softwaresoftware Writing programs for these systems required a deep Writing programs for these systems required a deep understanding of transactional semantics and manual understanding of transactional semantics and manual “checkpointing” at the application level “checkpointing” at the application level

Page 9: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Fault Tolerance on WindowsFault Tolerance on Windows

Microsoft designed the Windows Server 2003, Microsoft designed the Windows Server 2003, Enterprise Edition, operating system to fully support Enterprise Edition, operating system to fully support fault-tolerant serversfault-tolerant servers Specific enhancements in Windows Server Specific enhancements in Windows Server 2003 that apply to fault-tolerant servers include 2003 that apply to fault-tolerant servers include the following:the following:

1.1. Memory mirroringMemory mirroring2.2. Multipath I/OMultipath I/O3.3. Improvements in load balancing and failover for Improvements in load balancing and failover for

miniport driversminiport drivers4.4. Hot-plug PCI supportHot-plug PCI support5.5. Hot-add memory supportHot-add memory support

Page 10: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Adoption of Fault-Tolerance on WindowsAdoption of Fault-Tolerance on Windows

To increase availability for traditional Windows–To increase availability for traditional Windows–

based solutionsbased solutions As Windows–based solutions continue to become more As Windows–based solutions continue to become more mission-critical, some companies are improving their mission-critical, some companies are improving their application availability by moving these applications to application availability by moving these applications to fault-tolerant servers fault-tolerant servers

Page 11: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Adoption of Fault-Tolerance on WindowsAdoption of Fault-Tolerance on Windows

As a cost-effective alternative to proprietary As a cost-effective alternative to proprietary

platformsplatforms Companies are realizing lower costs by deploying fault-Companies are realizing lower costs by deploying fault-tolerant servers running Windows for solutions that have tolerant servers running Windows for solutions that have traditionally resided on clustered UNIX servers, traditionally resided on clustered UNIX servers,

mainframes, or proprietary fault-tolerant systemsmainframes, or proprietary fault-tolerant systems

Page 12: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Fault-Tolerance on Windows Fault-Tolerance on Windows

According to Stratus Technologies, which first began According to Stratus Technologies, which first began shipping fault-tolerant servers running a proprietary shipping fault-tolerant servers running a proprietary operating system in 1982 and added a UNIX-based operating system in 1982 and added a UNIX-based offering in 1995, the company’s three-year-old offering in 1995, the company’s three-year-old

Windows–based ftServerWindows–based ftServer product line has resulted in product line has resulted in more than 500 new customers in 2003 alone. NEC more than 500 new customers in 2003 alone. NEC Corporation, which began shipping mainframes running Corporation, which began shipping mainframes running proprietary operating systems in 1965, reports similar proprietary operating systems in 1965, reports similar findings since the introduction of its FT Series fault-findings since the introduction of its FT Series fault-tolerant servers for Windows in early 2001. tolerant servers for Windows in early 2001.

Page 13: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Complete SolutionsComplete Solutions

Downtime for Windows Server 2003–based solutions is typically due Downtime for Windows Server 2003–based solutions is typically due to hardware failures, bad device drivers, user error, poor change to hardware failures, bad device drivers, user error, poor change control processes, and so on, with a very small percentage control processes, and so on, with a very small percentage attributable to the core operating system attributable to the core operating system

Several fault-tolerant system vendors go a step further in delivering Several fault-tolerant system vendors go a step further in delivering availability-related services through continuous server monitoring. availability-related services through continuous server monitoring. As an example, every Stratus server continually monitors itself for As an example, every Stratus server continually monitors itself for component and operating system failure, and can be set to component and operating system failure, and can be set to immediately call into the company’s customer assistance center to immediately call into the company’s customer assistance center to report a failure or other important event. NEC offers similar service report a failure or other important event. NEC offers similar service offerings.offerings.

Page 14: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Unique BenefitsUnique Benefits

Reduced Time to Market Reduced Time to Market

Solutions intended to run on Windows–based fault-Solutions intended to run on Windows–based fault-tolerant servers can be developed and deployed as tolerant servers can be developed and deployed as rapidly as any other Windows–based application rapidly as any other Windows–based application

Companies can take advantage of the rich functionality Companies can take advantage of the rich functionality provided in the .NET Framework and the highly-provided in the .NET Framework and the highly-productive Microsoft Visual Studio® .NET integrated productive Microsoft Visual Studio® .NET integrated development system to rapidly develop custom development system to rapidly develop custom solutions, or they can choose from the full range of off-solutions, or they can choose from the full range of off-the-shelf Windows applicationsthe-shelf Windows applications

Page 15: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Unique BenefitsUnique Benefits

Ease of IntegrationEase of Integration With native support for industry standards such as XML With native support for industry standards such as XML Web services, the Microsoft platform and .NET Web services, the Microsoft platform and .NET technologies make it easy to integrate Windows–based technologies make it easy to integrate Windows–based solutions running on fault-tolerant servers with other solutions running on fault-tolerant servers with other systems. Microsoft BizTalk® Server extends these systems. Microsoft BizTalk® Server extends these capabilities even further, with more than 300 plug-in capabilities even further, with more than 300 plug-in BizTalk Adapters available to simplify enterprise BizTalk Adapters available to simplify enterprise application integration and enable companies to comply application integration and enable companies to comply with industry-specific electronic transaction formats such with industry-specific electronic transaction formats such

as HIPAA or EDIas HIPAA or EDI

Page 16: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Unique BenefitsUnique Benefits

Ease of ManagementEase of Management

Windows–based solutions running on fault-tolerant Windows–based solutions running on fault-tolerant servers can be administered easily using the servers can be administered easily using the comprehensive management tools provided in the comprehensive management tools provided in the Microsoft platform. For example, Microsoft Operations Microsoft platform. For example, Microsoft Operations Monitor enables companies to subject applications Monitor enables companies to subject applications running on Windows–based servers to granular real-time running on Windows–based servers to granular real-time monitoring, enabling administrators to detect many monitoring, enabling administrators to detect many problems before they can affect system availability problems before they can affect system availability

Page 17: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Unique BenefitsUnique Benefits

Lower Hardware CostsLower Hardware Costs

Fault-tolerant servers for Windows are available starting Fault-tolerant servers for Windows are available starting at under $20,000, a fraction of the typical $200,000-plus at under $20,000, a fraction of the typical $200,000-plus starting price for proprietary fault-tolerant platforms. starting price for proprietary fault-tolerant platforms. Combined with the superior cost-effectiveness of the Combined with the superior cost-effectiveness of the Microsoft platform, this order-of-magnitude decrease in Microsoft platform, this order-of-magnitude decrease in

hardware costs makeshardware costs makes fault-tolerance on Windows fault-tolerance on Windows economically justifiable in a far broader range of economically justifiable in a far broader range of

situations than fault-tolerance on proprietary platformssituations than fault-tolerance on proprietary platforms

Page 18: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

CClusterlusteringing

A cluster connects two or more servers together so that A cluster connects two or more servers together so that they appear as a single computer to clients. Connecting they appear as a single computer to clients. Connecting servers in a cluster allows for workload sharing, enables servers in a cluster allows for workload sharing, enables a single point of operation/management, and provides a a single point of operation/management, and provides a path for scaling to meet increased demand. Thus, path for scaling to meet increased demand. Thus, clustering gives you the ability to produce high clustering gives you the ability to produce high availability applications availability applications

Page 19: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Three Technologies for Clustering Three Technologies for Clustering

Microsoft servers provide three Microsoft servers provide three technologies to support clustering: technologies to support clustering:

1.1. Network Load Balancing (NLB), Network Load Balancing (NLB),

2.2. Component Load Balancing (CLB), and Component Load Balancing (CLB), and

3.3. Microsoft Cluster Service (MSCS). Microsoft Cluster Service (MSCS).

Page 20: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Network Load Balancing Network Load Balancing

Network Load Balancing acts as a front-end cluster, Network Load Balancing acts as a front-end cluster, distributing incoming IP traffic across a cluster of serversdistributing incoming IP traffic across a cluster of servers

Page 21: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Component Load Balancing Component Load Balancing

Component Load Balancing distributes workload across Component Load Balancing distributes workload across multiple servers running a site's business logic. It multiple servers running a site's business logic. It provides for dynamic balancing of COM+ components provides for dynamic balancing of COM+ components

across a set of up to eight identical servers.across a set of up to eight identical servers. COM+ is both an object-oriented programming COM+ is both an object-oriented programming architecture and a set of operating system services.architecture and a set of operating system services.

Page 22: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Cluster Service Cluster Service

Cluster Service acts as a back-end cluster; it provides high Cluster Service acts as a back-end cluster; it provides high availability for applications such as databases, messaging and file availability for applications such as databases, messaging and file and print services. MSCS attempts to minimize the effect of failure and print services. MSCS attempts to minimize the effect of failure on the system as any node (a server in the cluster) fails or is taken on the system as any node (a server in the cluster) fails or is taken offline offline

Page 23: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Failover Capability Through Failover Capability Through Microsoft Cluster Service Microsoft Cluster Service

MSCS failover capability is achieved through redundancy MSCS failover capability is achieved through redundancy across the multiple connected machines in the cluster, across the multiple connected machines in the cluster, each with independent failure states. each with independent failure states. Redundancy requires that applications be installed on Redundancy requires that applications be installed on multiple servers within the cluster. However, an multiple servers within the cluster. However, an application is online on only one node at any point in application is online on only one node at any point in time. As that application fails, or that server is taken time. As that application fails, or that server is taken down, the application is restarted on another node. down, the application is restarted on another node. The Windows Server 2003, Datacenter Edition supports The Windows Server 2003, Datacenter Edition supports up to 8 nodes in a cluster up to 8 nodes in a cluster

Page 24: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Each node has its own memory, system disk, operating Each node has its own memory, system disk, operating system and subset of the cluster's resources. system and subset of the cluster's resources.

If a node fails, the other node takes ownership of the If a node fails, the other node takes ownership of the failed node's resources (this process is known as failed node's resources (this process is known as "failover"). "failover").

Microsoft Cluster Service then registers the network Microsoft Cluster Service then registers the network address for the resource on the new node so that client address for the resource on the new node so that client traffic is routed to the system that is available and now traffic is routed to the system that is available and now owns the resource. When the failed resource is later owns the resource. When the failed resource is later brought back online, MSCS can be configured to brought back online, MSCS can be configured to

redistribute resources and client requests appropriatelyredistribute resources and client requests appropriately

Page 25: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Cluster Service Architecture Cluster Service Architecture

The Cluster Service The Cluster Service The Cluster Service is the core component and runs as a The Cluster Service is the core component and runs as a

high-priority system service. high-priority system service.

The Cluster Service controls cluster activities and The Cluster Service controls cluster activities and performs such tasks as coordinating event notification, performs such tasks as coordinating event notification, facilitating communication between cluster components, facilitating communication between cluster components, handling failover operations and managing the handling failover operations and managing the configuration. configuration.

Each cluster node runs its own Cluster ServiceEach cluster node runs its own Cluster Service..

Page 26: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

The Resource Monitor The Resource Monitor The Resource Monitor is an interface between the Cluster The Resource Monitor is an interface between the Cluster

Service and the cluster resources, and runs as an Service and the cluster resources, and runs as an independent process. The Cluster Service uses the independent process. The Cluster Service uses the Resource Monitor to communicate with the resource Resource Monitor to communicate with the resource

DLLsDLLs

Page 27: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

The Resource DLL The Resource DLL Every resource uses a resource DLL. The Resource Every resource uses a resource DLL. The Resource

Monitor calls the entry point functions of the Monitor calls the entry point functions of the resource DLL to check the status of the resource resource DLL to check the status of the resource and to bring the resource online and offline. and to bring the resource online and offline.

The resource DLL is responsible for communicating The resource DLL is responsible for communicating with its resource through any convenient IPC with its resource through any convenient IPC mechanism to implement these methods.mechanism to implement these methods.

Page 28: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Comparison: High-Availability Clusters and Comparison: High-Availability Clusters and

the Stratus ftServer Familythe Stratus ftServer Family Clusters are still the most frequent choice for meeting Clusters are still the most frequent choice for meeting defined targets for availability in Windows server defined targets for availability in Windows server environments environments

High-availability clusters and Stratus ftServer systems High-availability clusters and Stratus ftServer systems start out with one common characteristic: Both use start out with one common characteristic: Both use redundant hardware to eliminate single points of failure. redundant hardware to eliminate single points of failure. However, similarities between the two approaches end However, similarities between the two approaches end

therethere. .

Page 29: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

The Stratus ftServer family offers a choice of Dual The Stratus ftServer family offers a choice of Dual Modular Redundant (DMR) and Triple Modular Modular Redundant (DMR) and Triple Modular Redundant (TMR) configurations. DMR models include Redundant (TMR) configurations. DMR models include two lockstepped CPU/ memory units; TMR systems two lockstepped CPU/ memory units; TMR systems contain a total of three CPU/memory units contain a total of three CPU/memory units

In both systems, duplicate or triplicate motherboards In both systems, duplicate or triplicate motherboards execute all instructions in lockstep. If proprietary on-execute all instructions in lockstep. If proprietary on-board error detection circuitry identifies a fault, that board error detection circuitry identifies a fault, that motherboard is immediately isolated from the system motherboard is immediately isolated from the system and removed from serviceand removed from service

Page 30: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

A second level of error detection compares outputs from each A second level of error detection compares outputs from each CPU/memory unit on each I/O operation. CPU/memory unit on each I/O operation. In a DMR system, if a comparison error occurs with no on-board In a DMR system, if a comparison error occurs with no on-board error indication software algorithms based on motherboard history error indication software algorithms based on motherboard history are used to determine which board to remove from service. are used to determine which board to remove from service. In a TMR system, "odd-man-out" voting logic is used to identify and In a TMR system, "odd-man-out" voting logic is used to identify and isolate additional faults. In either event, processing continues on the isolate additional faults. In either event, processing continues on the remaining motherboards without interruption or performance remaining motherboards without interruption or performance degradation. degradation. The entire error detection and isolation process occurs in just The entire error detection and isolation process occurs in just milliseconds, without any interruption to system operation milliseconds, without any interruption to system operation

Page 31: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

In contrast, a high-availability cluster initiates its failover In contrast, a high-availability cluster initiates its failover process only after a failed node does not send a process only after a failed node does not send a "heartbeat" message. Crucial seconds can elapse before "heartbeat" message. Crucial seconds can elapse before the working node even begins the failover routine, which the working node even begins the failover routine, which results in downtime even under the best of results in downtime even under the best of circumstances. circumstances.

After failover initiation, the new cluster is formed, the After failover initiation, the new cluster is formed, the database is recovered, and applications are restarted. database is recovered, and applications are restarted. This recovery sequence can span many minutes This recovery sequence can span many minutes depending on the complexity and sophistication of the depending on the complexity and sophistication of the

application environment and the cluster configuration.application environment and the cluster configuration.

Page 32: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

In turn, the absence of a heartbeat may not always be a In turn, the absence of a heartbeat may not always be a reliable indicator of system error or availability e.g. as in reliable indicator of system error or availability e.g. as in the case of slow performance. In these cases, failover the case of slow performance. In these cases, failover sequences may be initiated unnecessarily sequences may be initiated unnecessarily

Page 33: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Software Availability Software Availability

A frequently mentioned strength of high-availability A frequently mentioned strength of high-availability clusters is their ability to provide a superior degree of clusters is their ability to provide a superior degree of software availability compared to traditional hardware software availability compared to traditional hardware fault-tolerant offerings fault-tolerant offerings

Stratus recognizes the need to enhance total system Stratus recognizes the need to enhance total system availability and delivers a number of unique features availability and delivers a number of unique features

that speed recovery in the event of a software outagethat speed recovery in the event of a software outage 1.1. Hardened device driversHardened device drivers 2.2. ftServer Software Availability ManagerftServer Software Availability Manager 3.3. ftServer Active UpgradeTM technologyftServer Active UpgradeTM technology

4.4. Quick DumpQuick Dump ““

Page 34: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Hardened device driversHardened device drivers

Specifically, hardened drivers detect and stop adapter Specifically, hardened drivers detect and stop adapter card writes beyond the physical memory allocated; card writes beyond the physical memory allocated; monitor the mean time between failure (MTBF) of the monitor the mean time between failure (MTBF) of the PCI card and remove it from service if thresholds are PCI card and remove it from service if thresholds are crossed; and support visual indicators that communicate crossed; and support visual indicators that communicate

the device state.the device state.

Page 35: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

ftServer Software Availability ManagerftServer Software Availability Manager

In addition to Microsoft Windows operating system In addition to Microsoft Windows operating system monitoring, the ftServer Software Availability Manager monitoring, the ftServer Software Availability Manager tracks the activity of CPU, memory, and disk resources tracks the activity of CPU, memory, and disk resources against thresholds specified by the system administrator against thresholds specified by the system administrator

Page 36: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

ftServer Active UpgradeTM technologyftServer Active UpgradeTM technology

Active Upgrade technology adds a new availability Active Upgrade technology adds a new availability dimension beyond the field-proven 99.999% uptime dimension beyond the field-proven 99.999% uptime protection for which Stratus servers are known. Unlike protection for which Stratus servers are known. Unlike cluster online upgrade, the complexities associated with cluster online upgrade, the complexities associated with upgrading multiple systems, revision synchronization upgrading multiple systems, revision synchronization between systems, and recovery from issues between systems, and recovery from issues encountered during an upgrade are vastly reduced encountered during an upgrade are vastly reduced because the Active Upgrade process takes place on a because the Active Upgrade process takes place on a

single fault-tolerant ftServer systemsingle fault-tolerant ftServer system

Page 37: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

Quick DumpQuick Dump

This facility allows the server to be This facility allows the server to be restarted rapidly after an operating system restarted rapidly after an operating system outage, without sacrificing information outage, without sacrificing information needed to analyze the cause needed to analyze the cause

Page 38: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

ConclusionConclusion

Fault-tolerance and software-based clustering provide Fault-tolerance and software-based clustering provide two powerful options for achieving mission-critical two powerful options for achieving mission-critical availability availability

Windows–based fault-tolerant solutions carry far lower Windows–based fault-tolerant solutions carry far lower costs than proprietary solutions, enabling companies in costs than proprietary solutions, enabling companies in all industries to achieve a positive return on investment all industries to achieve a positive return on investment in a reasonable timeframe across a much broader range in a reasonable timeframe across a much broader range

of scenariosof scenarios As a result As a result Windows–based fault-tolerant solutions Windows–based fault-tolerant solutions is is very capable of high availability with the true server very capable of high availability with the true server

technologies like technologies like Stratus ftServer SystemsStratus ftServer Systems

Page 39: Strategies for Fault-Tolerant Computing For Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ.

ReferencesReferences

http://msdn2.microsoft.com/en-us/library/ms952401.aspxhttp://msdn2.microsoft.com/en-us/library/ms952401.aspx

http://www.microsoft.com/windowsserver2003/evaluationhttp://www.microsoft.com/windowsserver2003/evaluation/performance/faulttolerance.mspx/performance/faulttolerance.mspx

http://www.stratus.com/whitep/evalalt/haclus.htmhttp://www.stratus.com/whitep/evalalt/haclus.htm

http://www.stratus.com/resources/pdf/transerrors.pdfhttp://www.stratus.com/resources/pdf/transerrors.pdf

http://www.stratus.com/http://www.stratus.com/resourcesresources//pdfpdf//evalaltevalalt..pdfpdf