Clustering Next Wave In PC Computing. 2 PP150299.ppt Cluster Concepts 101 This section is about...
-
Upload
alexandrina-watson -
Category
Documents
-
view
215 -
download
0
Transcript of Clustering Next Wave In PC Computing. 2 PP150299.ppt Cluster Concepts 101 This section is about...
ClusteringClusteringNext Wave In PC Computing
2PP150299.ppt
Cluster Concepts 101Cluster Concepts 101
This section is about clusters in general,
we’ll get to Microsoft’s Wolfpack cluster
implementation in the next section.
3PP150299.ppt
Why Learn About ClustersWhy Learn About Clusters
Today clusters are a niche Unix market
But Microsoft will bring clusters to the masses Last October, Microsoft announced NT clusters
SCO announced UnixWare clusters Sun announced Solaris / Intel clusters Novell announced Wolf Mountain clusters
In 1998, 2M Intel servers will ship 100K in clusters
In 2001, 3M Intel servers will ship 1M in clusters (IDC’s forecast)
Clusters will be a huge market andRAID is essential to clusters
4PP150299.ppt
What Are Clusters?What Are Clusters? Group of independent systems that
Function as a single system Appear to users as a single system And are managed as a single system’
Clusters are “virtual servers”
5PP150299.ppt
Why ClustersWhy Clusters
#1. Clusters Improve System Availability This is the primary value in Wolfpack-I clusters
#2. Clusters Enable Application Scaling
#3. Clusters Simplify System Management
#4. Clusters (with Intel servers) Are Cheap
6PP150299.ppt
Why Clusters - #1Why Clusters - #1 #1. Clusters Improve System Availability
When a networked server fails, the service it provided is down
When a clustered server fail, the service it provided “failsover” and downtime is avoided
MailServer
InternetServer
Networked Servers
Clustered Servers
Mail & Internet
7PP150299.ppt
Why Clusters - #2Why Clusters - #2
#2. Clusters Enable Application Scaling With networked SMP servers, application scaling is limited to a single server With clusters, applications scale across multiple SMP servers (typically up to
16 servers)
8PP150299.ppt
Why Clusters - #3Why Clusters - #3 #3. Clusters Simplify System Management
Clusters present a Single System Image; the cluster looks like a single server to management applications
Hence, clusters reduce system management costs
Three Management Domains
One Management Domain
9PP150299.ppt
Why Clusters - #4Why Clusters - #4
#4. Clusters (with Intel servers) Are Cheap Essentially no additional hardware costs Microsoft charges an extra $3K per node
Windows NT Server $1,000 Windows NT Server, Enterprise Edition $4,000
Note: Proprietary Unix cluster software costs $10K to $25K per node.
10PP150299.ppt
An Analogy to RAIDAn Analogy to RAID RAID Makes Disks Fault Tolerant
Clusters make servers fault tolerant
RAID Increases I/O Performance Clusters increase compute performance
RAID Makes Disks Easier to Manage
Clusters make servers easier to manage
RAID
11PP150299.ppt
Two Flavors of ClustersTwo Flavors of Clusters
#1. High Availability Clusters
Microsoft’s Wolfpack 1 Compaq’s Recovery Server
#2. Load Balancing Clusters (a.k.a. Parallel Application Clusters)
Microsoft’s Wolfpack 2 Digital’s VAXClusters
Note: Load balancing clusters are a superset of high availability clusters.
12PP150299.ppt
High Availability ClustersHigh Availability Clusters Two node clusters (node = server) During normal operations, both servers do useful work
Failover When a node fails, applications failover to the surviving
node and it assumes the workload of both nodes
Mail Web
Mail & Web
13PP150299.ppt
High Availability ClustersHigh Availability Clusters
Failback
When the failed node is returned to service, the applications failback
Mail Web
WebMail
14PP150299.ppt
Load Balancing ClustersLoad Balancing Clusters Multi-node clusters (two or more nodes)
Load balancing clusters typically run a single application, (e.g. database, distributed across all nodes)
Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear)
3,000 TPM 3,600 TPM
15PP150299.ppt
Load Balancing ClustersLoad Balancing Clusters
Cluster rebalances the workload when a node dies
If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies
16PP150299.ppt
Two Cluster ModelsTwo Cluster Models
#1. “Shared Nothing” Model Microsoft’s Wolfpack Cluster
#2. “Shared Disk” Model VAXClusters
17PP150299.ppt
#1. “Shared Nothing” Model#1. “Shared Nothing” Model
At any moment in time, each disk is owned and addressable by only one server
“Shared nothing” terminology is confusing Access to disks is shared -- on the same bus
But at any moment in time, disks are not shared
RAID
18PP150299.ppt
#1. “Shared Nothing” Model #1. “Shared Nothing” Model
When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients
RAID
19PP150299.ppt
#2. “Shared Disk” Model#2. “Shared Disk” Model Disks are not owned by servers but shared by all servers At any moment in time, any server can access any disk
Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data)
RAID
20PP150299.ppt
Cluster InterconnectCluster Interconnect
This is about how servers are tied together and how disks are physically connected to the cluster
21PP150299.ppt
Cluster InterconnectCluster Interconnect
Clustered servers always have a client network interconnect, typically Ethernet, to talk to users
And at least one cluster interconnect to talk to other nodes and to disks
RAID
Cluster Interconnect
Client Network
HBA HBA
22PP150299.ppt
Cluster Interconnects Cluster Interconnects (cont’d)(cont’d)
Or They Can Have Two Cluster Interconnects One for nodes to talk to each other -- “Heartbeat Interconnect”
Typically Ethernet
And one for nodes to talk to disks -- “Shared Disk Interconnect” Typically SCSI or Fibre Channel
RAID
Shared Disk Interconnect
Cluster Interconnect
HBA HBA
NIC NIC
Micosoft’s Wolfpack ClustersMicosoft’s Wolfpack Clusters
24PP150299.ppt
Clusters Are Not NewClusters Are Not New
Clusters Have been Around Since 1985
Most UNIX Systems are Clustered
What’s New is Microsoft Clusters Code named “Wolfpack” Named Microsoft Cluster Server (MSCS)
Software that provides clustering
MSCS is part of Window NT, Enterprise Server
25PP150299.ppt
Microsoft Cluster RolloutMicrosoft Cluster Rollout
Wolfpack-I In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes
Transaction Server and Reliable Message Queue]
Two node “failover cluster”
Shipped October, 1997
Wolfpack-II In Windows NT, Enterprise Server 5.0 (NT/E 5.0)
“N” node (probably up to 16) “load balancing cluster”
Beta in 1998 and ship in 1999
26PP150299.ppt
MSCS (NT/E, 4.0) OverviewMSCS (NT/E, 4.0) Overview Two Node “Failover” Cluster “Shared Nothing” Model
At any moment in time, each disk is owned and addressable by only one server
Two Cluster Interconnects “Heartbeat” cluster interconnect
Ethernet
Shared disk interconnect SCSI (any flavor) Fibre Channel (SCSI protocol over Fibre Channel)
Each Node Has a “Private System Disk” Boot disk
27PP150299.ppt
MSCS (NT/E, 4.0) TopologiesMSCS (NT/E, 4.0) Topologies
#1. Host-based (PCI) RAID Arrays
#2. External RAID Arrays
28PP150299.ppt
NT Cluster with NT Cluster with Host-Based RAID ArrayHost-Based RAID Array Each node has
Ethernet NIC -- Heartbeat Private system disk (generally on an HBA) PCI-based RAID controller -- SCSI or Fibre
Nodes share access to data disks but do not share data
RAIDShared Disk Interconnect
“Heartbeat” Interconnect
RAID
HBA HBANICNIC
29PP150299.ppt
NT Cluster with NT Cluster with SCSI External RAID ArraySCSI External RAID Array
Each node has Ethernet NIC -- Heartbeat Multi-channel HBA’s connect boot disk and external array
Shared external RAID controller on the SCSI Bus -- DAC SX
RAID
Shared Disk Interconnect
“Heartbeat” Interconnect
HBAHBA
NICNIC
30PP150299.ppt
NT Cluster with NT Cluster with Fibre External RAID ArrayFibre External RAID Array
DAC SF or DAC FL (SCSI to disks) DAC FF (Fibre to disks)
RAID
Shared Disk Interconnect
“Heartbeat” Interconnect
HBAHBA
NICNIC
MSCS -- A Few of the DetailsMSCS -- A Few of the Details
Managers -->
32PP150299.ppt
Cluster Interconnect & Cluster Interconnect & HeartbeatsHeartbeats
Cluster Interconnect Private Ethernet between nodes Used to transmit “I’m alive” heartbeat messages
Heartbeat Messages When a node stops getting heartbeats, it assumes the other
node has died and initiates failover In some failure modes both nodes stop getting heartbeats
(NIC dies or someone trips over the cluster cable) Both nodes are still alive But each thinks the other is dead Split brain syndrome Both nodes initiate failover Who wins?
33PP150299.ppt
Quorum DiskQuorum Disk Special cluster resource that stores the cluster log When a node joins a cluster, it attempts to reserve the quorum disk
(purple disk) If the quorum disk does not have an owner, the node takes ownership and
forms a cluster If the quorum disk has an owner, the node joins the cluster
RAIDDisk Interconnect
Cluster “Heartbeat” Interconnect
RAID
HBA HBA
34PP150299.ppt
Quorum DiskQuorum Disk
If Nodes Cannot Communicate (no heartbeats) Then only one is allow to continue operating
They use the quorum disk to decide which one lives
Each node waits, then tries to reserve the quorum disk
Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk
When the other node attempts to reserve the quorum disk, it will find that it’s already owned
The node that doesn’t own the quorum disk then failsover
This is called the Challenge / Defense Protocol
35PP150299.ppt
Microsoft Cluster Server (MSCS)Microsoft Cluster Server (MSCS)
MSCS Objects Lots of MSCS objects but only two we care about
Resources and Groups Resources
Applications, data files, disks, IP addresses, ...
Groups Application and related resources like data on disks
36PP150299.ppt
Microsoft Cluster Server (MSCS)Microsoft Cluster Server (MSCS)
When a server dies, groups failover
When a server is repaired and returned to service, groups failback
Since data on disks is included in groups, disks failover and failback
Group: Mail
Resource
Resource
Resource
Group: Mail
Resource
Resource
Resource
Group: Mail
Resource
Resource
Resource
Group: Web
Resource
Resource
Resource
Group: Web
Resource
Resource
Resource
Group: Web
Resource
Resource
Resource
37PP150299.ppt
Groups FailoverGroups Failover
Groups are the entities that failover
And they take their disks with them
Group: Mail
Resource
Resource
Resource
Group: Mail
Resource
Resource
Resource
Group: Mail
Resource
Resource
Resource
Group: Web
Resource
Resource
Resource
Group: Web
Resource
Resource
Resource
Group: Web
Resource
Resource
Resource
Group: Mail
Resource
Resource
Resource
Group: Mail
Resource
Resource
Resource
Group: Mail
Resource
Resource
Resource
38PP150299.ppt
Microsoft Cluster CertificationMicrosoft Cluster Certification
Two Levels of Certification Cluster Component Certification
HBA’s and RAID controllers must be certified When they pass:
They’re listed on the Microsoft web site www.microsoft.com/hwtest/hcl/
They’re eligible for inclusion in cluster system certification Cluster System Certification
Complete two node cluster When they pass:
They’re listed on the Microsoft web site They’ll be supported by Microsoft
Each Certification Takes 30 - 60 Days
Mylex NT Cluster SolutionsMylex NT Cluster Solutions
41PP150299.ppt
Internal vs External RAID PositioningInternal vs External RAID Positioning
Internal RAID Lower cost solution Higher performance in read-intensive applications
Proven TPC-C performance enhances cluster performance
External RAID Higher performance in write-intensive applications
Write-back cache is turned-off in PCI-RAID controllers Higher connectivity
Attach more disk drives
Greater footprint flexibility Until PCI-RAID implements fibre
42PP150299.ppt
Why We’re Better -- External Why We’re Better -- External RAIDRAID
Robust Active - Active Fibre Implementation Shipping active - active for over a year It works in NT (certified) and Unix environments Have Fibre on the back-end soon
Mirrored Cache Architecture Without mirrored cache, data is inaccessible or dropped on the floor
when a controller fails Unless you turn-off the write-back cache which degrades write
performance by 5x to 30x.
Four to Six Disk Channels I/O bandwidth and capacity scaling
Dual Fibre Host Ports NT expects to access data over pre-configured paths If it doesn’t find the data over the expected path, then I/O’s don’t
complete and applications fail
43PP150299.ppt
SX Active / Active DuplexSX Active / Active Duplex
HBAHBA
SXSX
Ultra SCSI Disk Interconnect
Cluster Interconnect
44PP150299.ppt
SF (or FL) Active / Active DuplexSF (or FL) Active / Active Duplex
HBAHBA
SFSF
FC HBA FC HBA
Single FC Array Interconnect
45PP150299.ppt
SF (or FL) Active / Active DuplexSF (or FL) Active / Active Duplex
HBAHBADual FC Array Interconnect
FC HBA FC HBA
FC Disk Interconnect
FC HBA FC HBA
SFSF
46PP150299.ppt
FF Active / Active DuplexFF Active / Active Duplex
HBAHBA
Single FC Array Interconnect
FC HBA FC HBA
FFFF
47PP150299.ppt
FF Active / Active DuplexFF Active / Active Duplex
HBAHBA
Dual FC Array Interconnect
FC HBA FC HBA
FC HBA FC HBA
FFFF
48PP150299.ppt
Why We’ll Be Better -- Why We’ll Be Better -- Internal RAIDInternal RAID
Deliver Auto-Rebuild
Deliver RAID Expansion MORE-I Add Logical Units On-line MORE-II Add or Expand Logical Units On-
line
Deliver RAID Level Migration 0 ---> 1 1 ---> 0 0 ---> 5 5 ---> 0 1 ---> 5 5 ---> 1
And (of course) Award Winning Performance
49PP150299.ppt
Nodes have: Ethernet NIC -- Heartbeat Private system disks (HBA) PCI-based RAID controller
eXtremeRAID
Shared Disk Interconnect
“Heartbeat” Interconnect
eXtremeRAID
HBA HBANICNIC
NT Cluster with NT Cluster with Host-Based RAID ArrayHost-Based RAID Array
50PP150299.ppt
Why eXtremeRAID & DAC960PJ ClustersWhy eXtremeRAID & DAC960PJ Clusters
Typically four or less processors
Offers a less expensive, integrated RAID solution
Can combine clustered and non clustered applications in the same enclosure
Uses today’s readily available hardware
51PP150299.ppt
TPC-C Performance for ClustersTPC-C Performance for Clusters
Two ExternalUltra ChannelsAt 40 MB/sec
32 bit PCI bus between the controller and the server, providing burst data transfer rates up to 132 MB/sec.
Three internalUltra ChannelsAt 40 MB/sec
66 Mhz I960 processor off-loads RAID management from the host CPU
DAC960PJ
52PP150299.ppt
eXtremeRAID™ achieves breakthrough in RAID technology, eliminates storage bottlenecks and delivers scaleable performance for NT Clusters.
LEDsSerial Port
233 MHzRISC processor
CPUNVRAM
Ch 0 Ch 1
Ch
1C
h 0
(bot
tom
)C
h 2
(top
)
SCSI SCSI
SCSIPCIBridge
BASS
DAC Memory Modulewith BBU
80 MB/sec.
80 MB/sec.
80 MB/sec.
64 bit PCI bus doubles data bandwidth between the controller and the server, providing burst data transfer rates up to 266 MB/sec.
3 - Ultra2 SCSI LVD channels for up to 42
shared storage devices and Connectivity Up
To 12 Meters
233 MHz strong ARM RISC processor off-loads RAID management from the host CPU
Mylex’s new firmware is optimized for performance and manageability
eXtremeRAID™ supports up to 42 drives, per cluster, as much as 810 GB of capacity per controller. Performance increases as you add drives.
eXtremeRAIDeXtremeRAID™™:: Blazing Clusters Blazing Clusters
53PP150299.ppt
eXtremeRAIDeXtremeRAID™™ 1100 NT Clusters 1100 NT Clusters Nodes have:
Ethernet NIC -- Heartbeat Private system disks (HBA) PCI-based RAID controller
Nodes share access to data disks but do not share data
3 Shared Ultra2 Interconnects
“Heartbeat” Interconnect
HBA HBANICNIC
eXtremeRAID
eXtremeRAID
54PP150299.ppt
Cluster Support PlansCluster Support Plans
Internal RAID Windows NT 4.0 1998 Windows NT 5.0 1999 Novell Orion Q4 98 SCO TBD SUN TBD
External RAID Windows NT 4.0
1998 Windows NT 5.0
1999 Novell Orion TBD SCO TBD
55PP150299.ppt
Plans For NT Cluster CertificationPlans For NT Cluster Certification
Microsoft Clustering (submission dates) DACSX Completed (Simplex) DACSF Completed (Simplex) DACSX July (Duplex) DACSF July (Duplex) DACFL August (Simplex) DACFL August (Duplex) DAC960 PJ Q4 ‘99 eXtremeRAID™ 1164 Q4 ‘99 AcceleRAID™ Q4 ‘99
56PP150299.ppt
What RAID Arrays are Right for ClustersWhat RAID Arrays are Right for Clusters
eXtremeRAID™-1100AcceleRAID™ 200AcceleRAID™ 250
DAC SFDAC FLDAC FF