1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is...

47
1 Three Talks • Scalability Terminology – Gray (with help from Devlin, Laing, Spix) • What Windows is doing re this – Laing The M$ PetaByte (as time allows) – Gray

Transcript of 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is...

Page 1: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

1

Three Talks

• Scalability Terminology– Gray (with help from Devlin, Laing, Spix)

• What Windows is doing re this– Laing

• The M$ PetaByte (as time allows)

– Gray

Page 2: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

2

Terminology for ScaleabilityBill Devlin, Jim Gray, Bill Laing, George Spix,,,,paper at: ftp://ftp.research.microsoft.com/pub/tr/tr-99-85.doc

• Farms of servers:– Clones: identical

• Scaleability + availability

– Partitions: • Scaleability

– Packs• Partition availability

via fail-over

• GeoPlex – for disaster tolerance.

Farm

Clone

SharedNothing

SharedDisk

Partition

Pack

SharedNothing

Active-Active

Active-Passive

GeoPlex

Page 3: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

3

Unpredictable Growth• The TerraServer Story:

– Expected 5 M hits per day

– Got 50 M hits on day 1

– Peak at 20 M hpd on a “hot” day

– Average 5 M hpd over last 2 years

• Most of us cannot predict demand– Must be able to deal with NO demand

– Must be able to deal with HUGE demand

Page 4: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

4

Web Services Requirements• Scalability: Need to be able to add capacity

– New processing– New storage– New networking

• Availability: Need continuous service– Online change of all components (hardware and software)– Multiple service sites– Multiple network providers

• Agility: Need great tools– Manage the system– Change the application several times per year.– Add new services several times per year.

Page 5: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

5

Premise:

Each Site is a Farm • Buy computing by the slice (brick):

– Rack of servers + disks.

– Functionally specialized servers

• Grow by adding slices– Spread data and

computation to new slices

• Two styles:– Clones: anonymous servers

– Parts+Packs: Partitions fail over within a pack

• In both cases, GeoPlex remote farm for disaster recovery

SwitchedEthernet

SwitchedEthernet

www.microsoft.com(3)

search.microsoft.com(1)

premium.microsoft.com(1)

European Data Center

FTPDownload Server

(1)

SQL SERVERS(2)

Router

msid.msn.com(1)

MOSW estAdmin LAN

SQLNetFeeder LAN

FDDI Ring(MIS4)

Router

www.microsoft.com(5)

Building 11

Live SQL Server

Router

home.microsoft.com(5)

FDDI Ring(MIS2)

www.microsoft.com(4)

activex.microsoft.com(2)

search.microsoft.com(3)

register.microsoft.com(2)

msid.msn.com(1)

FDDI Ring(MIS3)

www.microsoft.com(3)

premium.microsoft.com(1)

msid.msn.com(1)

FDDI Ring(MIS1)

www.microsoft.com(4)

premium.microsoft.com(2)

register.microsoft.com(2)

msid.msn.com(1) Primary

Gigaswitch

SecondaryGigaswitch

Staging Servers(7)

search.microsoft.com(3)

support.microsoft.com(2)

register.msn.com(2)

The Microsoft.Com Site

MOSWest

DMZ Staging Servers

\ \Tweeks\Statistics\LAN and Server Name Info\Cluster Process F low\MidYear98a.vsd12/15/97

Internet

Internet

Log Processing

All servers in Building11are accessable fromcorpnet.

IDC Staging Servers

Live SQL Servers

SQL Consolidators

Japan Data Centerwww.microsoft.com

(3)premium.microsoft.com(1)

HTTPDownload Servers

(2) Router

search.microsoft.com(2)

SQL SERVERS(2)

msid.msn.com(1)

FTPDownload Server

(1)Router

Router

Router

Router

Router

Router

Router

Router

Internal WW W

SQL Reporting

home.microsoft.com(4)

home.microsoft.com(3)

home.microsoft.com(2)

register.microsoft.com(1)

support.microsoft.com(1)

Internet

13DS3

(45 Mb/Sec Each)

2OC3

(100Mb/Sec Each)

2Ethernet

(100 Mb/Sec Each)

cdm.microsoft.com(1)

FTP Servers

DownloadReplication

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $83KFY98 Fcst: 12

Ave CFG: 4xP5,256 RAM,12 GB HDAve Cost: $24KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 3

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 17

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $43KFY98 Fcst: 10

Ave CFG: 4xP6512 RAM28 GB HDAve Cost: $35KFY98 Fcst: 17 Ave CFG: 4xP6,

256 RAM,30 GB HDAve Cost: $25KFY98 Fcst: 2

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 3

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $35KFY98 Fcst: 2

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 12

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 2

Ave CFG: 4xP6,1 GB RAM,180 GB HDAve Cost: $128KFY98 Fcst: 2

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 7

Ave CFG: 4xP5,256 RAM,20 GB HDAve Cost: $29KFY98 Fcst: 2

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 9

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1

FTP.microsoft.com(3)

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1

Ave CFG: 4xP6,1 GB RAM,160 GB HDAve Cost: $83KFY98 Fcst: 2

Microsoft.com late 2000Canyon ParkFTP 6Build Servers 32IIS 210Application 2Exchange 24Network/Monitoring 12SQL 120Search 2NetShow 3NNTP 16SMTP 6Stagers 26total 459

Page 6: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

6

Scale UP

Scaleable Systems

• ScaleUP: grow by adding components to a single system.

• ScaleOut: grow by adding more systems.

Scale OUT

Page 7: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

7

ScaleUP and Scale OUT• Everyone does both.

• Choice’s – Size of a brick

– Clones or partitions

– Size of a pack

• Who’s software?– scaleup and scaleout

both have a large software component

• 1M$/slice– IBM S390?

– Sun E 10,000?

• 100 K$/slice– Wintel 8X

• 10 K$/slice– Wintel 4x

• 1 K$/slice– Wintel 1x

Page 8: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

8

Clones: Availability+Scalability• Some applications are

– Read-mostly – Low consistency requirements– Modest storage requirement (less than 1TB)

• Examples:– HTML web servers (IP sprayer/sieve + replication)– LDAP servers (replication via gossip)

• Replicate app at all nodes (clones)Replicate app at all nodes (clones)• Load Balance:

– Spray& Sieve: requests across nodes.

– Route: requests across nodes.

• Grow: adding clones• Fault tolerance: stop sending to that clone.

Page 9: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

9

Two Clone Geometries• Shared-Nothing: exact replicas

• Shared-Disk (state stored in server)Shared Nothing Clones Shared Disk Clones

If clones have any state: make it disposable. Manage clones by reboot, failing that replace.One person can manage thousands of clones.

Page 10: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

10

Clone Requirements • Automatic replication (if they have any state)

– Applications (and system software)

– Data

• Automatic request routing– Spray or sieve

• Management:– Who is up?

– Update management & propagation

– Application monitoring.

• Clones are very easy to manage:– Rule of thumb: 100’s of clones per admin.

Page 11: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

11

PartitionsPartitions for Scalability• Clones are not appropriate for some apps.

– State-full apps do not replicate well– high update rates do not replicate well

• Examples– Email– Databases– Read/write file server…– Cache managers– chat

• Partition state among servers • Partitioning:

– must be transparent to client.– split & merge partitions online

Page 12: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

12

Packs for Availability• Each partition may fail (independent of others)

• Partitions migrate to new node via fail-over– Fail-over in seconds

• Pack: the nodes supporting a partition– VMS Cluster, Tandem, SP2 HACMP,..

– IBM Sysplex™– WinNT MSCS (wolfpack)

• Partitions typically grow in packs.• ActiveActive: all nodes provide service

• ActivePassive: hot standby is idle

• Cluster-In-A-Box now commodity

Page 13: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

13

Partitions and Packs

PartitionsScalability

Packed PartitionsScalability + Availability

Page 14: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

14

Parts+Packs Requirements• Automatic partitioning (in dbms, mail, files,…)

– Location transparent

– Partition split/merge

– Grow without limits (100x10TB)

– Application-centric request routing

• Simple fail-over model– Partition migration is transparent

– MSCS-like model for services

• Management:– Automatic partition management (split/merge)

– Who is up?

– Application monitoring.

Page 15: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

15

GeoPlex: Farm Pairs• Two farms (or more)

• State (your mailbox, bank account) stored at both farms

• Changes from one sent to other

• When one farm failsother provides service

• Masks– Hardware/Software faults– Operations tasks (reorganize, upgrade move)

– Environmental faults (power fail, earthquake, fire)

Page 16: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

16

DirectoryFail-Over

Load Balancing• Routes request to right farm

– Farm can be clone or partition

• At farm, routes request to right service

• At service routes request to– Any clone

– Correct partition.

• Routes around failures.

Page 17: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

17

99 999well-managed nodes

well-managed packs & clones

well-managed GeoPlex

Masks some hardware failures

Masks hardware failures, Operations tasks (e.g. software upgrades)Masks some software failures

Masks site failures (power, network, fire, move,…) Masks some operations failuresA

vaila

bilit

y

Page 18: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

18

ClonedPacked

file servers

Packed Partitions: Database Transparency

Cluster Scale Out Scenarios

SQL Temp StateWeb File StoreA

ClonedFront Ends(firewall, sprayer,

web server)

SQL Partition 3

The FARM: Clones and Packs of Partitions

Web Clients

Web File StoreBreplication

SQL DatabaseSQL Partition 2 SQL Partition1

Load BalanceLoad Balance

Page 19: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

19

Some Examples:• TerraServer:

– 6 IIS clone front-ends (wlbs)– 3-partition 4-pack backend: 3 active 1 passive– Partition by theme and geography (longitude)– 1/3 sysadmin

• Hotmail:– 1000 IIS clone HTTP login – 3400 IIS clone HTTP front door– + 1000 clones for ad rotator, in/out bound… – 115 partition backend (partition by mailbox)– Cisco local director for load balancing– 50 sysadmin

• Google: (inktomi is similar but smaller)– 700 clone spider– 300 clone indexer– 5-node geoplex (full replica)– 1,000 clones/farm do search– 100 clones/farm for http– 10 sysadmin

See Challenges to Building Scalable Services: A Survey of Microsoft’s Internet Services, Steven Levi and Galen Hunt http://big/megasurvey/megasurvey.doc.

Page 20: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

20

Acronyms• RACS: Reliable Arrays of Cloned Servers

• RAPS: Reliable Arrays of partitioned and Packed Servers (the first p is silent ).

Page 21: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

21

Emissaries and Fiefdoms

• Emissaries are stateless (nearly) Emissaries are easy to clone.

• Fiefdoms are stateful Fiefdoms get partitioned.

Page 22: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

22

Summary• Terminology for scaleability• Farms of servers:

– Clones: identical• Scaleability + availability

– Partitions: • Scaleability

– Packs• Partition availability via fail-over

• GeoPlex for disaster tolerance.

Architectural Blueprint for Large eSites Bill Laing http://msdn.microsoft.com/msdn-online/start/features/DNAblueprint.asp

Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Gray, Bill Laing, George Spix MS-TR-99-85 ftp://ftp.research.microsoft.com/pub/tr/tr-99-85.doc

Farm

Clone

SharedNothing

SharedDisk

Partition

Pack

SharedNothing

Active-Active

Active-Passive

GeoPlex

Page 23: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

23

Three Talks

• Scalability Terminology– Gray (with help from Devlin, Laing, Spix)

• What Windows is doing re this– Laing

• The M$ PetaByte (as time allows)

– Gray

Page 24: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

24

What Windows is Doing• Continued architecture and analysis work• AppCenter, BizTalk, SQL, SQL Service Broker, ISA,…

all key to Clones/Partitions• Exchange is an archetype

– Front ends, directory, partitioned, packs, transparent mobility.

• NLB (clones) and MSCS (Packs)• High Performance Technical Computing• Appliances and hardware trends• Management of these kind of systems• Still need good ideas on….

Page 25: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

25

Architecture and Design work• Produced an architectural Blueprint for large eSites

published on MSDN– http://msdn.microsoft.com/msdn-online/start/features/DNAblueprint.asp

• Creating and testing instances of the architecture– Team led by Per Vonge Neilsen– Actually building and testing examples of the architecture with

partners. (sometimes known as MICE)

• Built a scalability “Megalab” run by Robert Barnes– 1000 node cyber wall,

315 1U Compaq DL360s, 32 8ways, 7000 disks

Page 26: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

26

Page 27: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

27

Clones and Packs aka Clustering• Integrated the NLB and MSCS teams

– Both focused on scalability and availability

– NLB for Clones

– MSCS for Partitions/Packs

• Vision is a single communications and group membership infrastructure and a set of management tools for Clones, Partitions, and Packs

• Unify management for clones/partitions at BOTH: OS and app level (e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)

Page 28: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

28

Clustering in Whistler Server• Microsoft Cluster Server

– Much improved setup and installation

– 4 node support in Advanced server• Kerberos support for Virtual Servers• Password change without restarting cluster service• 8 node support in Datacenter• SAN enhancements (Device reset not bus reset for disk arbitration,

Shared disk and boot disk on same bus)• Quorum of nodes supported (no shared disk needed)

• Network Load Balancer– New NLB manager

• Bi-Directional affinity for ISA as a Proxy/Firewall

• Virtual cluster support (Different port rules for each IP addr)

• Dual NIC support

Page 29: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

29

Geoclusters

• AKA - Geographically dispersed (Packs)– Essentially the nodes and storage are replicated at 2

sites, disks are remotely mirrored

• Being deployed today, helping vendors them get certified, we still need better tools

• Working with– EMC, Compaq, NSISoftware, StorageApps

• Log shipping (SQL) and extended VLANs (IIS) are also solutions

Page 30: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

30

High Performance ComputingLast year (CY2000)

• This work is a part of server scale-out efforts (BLaing)

• Web site and HPC Tech Preview CD late last year– A W2000 “Beowulf” equivalent

w/ 3rd-party tools

• Better than the competition– 10-25% faster than Linux on SMPs

(2, 4 & 8 ways)

– More reliable than SP2 (!)

– Better performance & integration w/ IBM periphs (!)

• But it lacks MPP debugger, tools, evangelism, reputation

• See ../windows2000/hpc

• Also \\jcbach\public\cornell*

This year (CY2001)• Partner w/ Cornell/MPI-Soft/+

– Unix to W2000 projects– Evangelism of commercial HPC

(start w/ financial svcs) – Showcase environment & apps

(EBC support)– First Itanium FP “play-offs”– BIG tools integration / beta

• Dell & Compaq offer web HPC buy and support experience (buy capacity by-the-slice)

• Beowulf-on-W2000 book by Tom Sterling (author of Beowulf on Linux)

• Gain on Sun in the www.top500.org list

• Address the win-by-default assumption for Linux in HPC

No vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.No vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.

Page 31: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

31

Appliances and Hardware Trends

• The appliances team under TomPh is focused on dramatically simplifying the user experience of installing the kind of devices– Working with OEMs to adopt WindowsXP

• Ultradense servers are on the horizon– 100s of servers per rack– Manage the rack as one

• Infiniband and 10 GbpsEthernet change things.

Page 32: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

32

Operations and Management

• Great research work done in MSR on this topic– The Mega services paper by Levi and Hunt– The follow on BIG project developed the ideas of

• Scale Invariant Service Descriptions with

• automated monitoring and

• deployment of servers.

• Building on that work in Windows Server group

• AppCenter doing similar things at app level

Page 33: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

33

Still Need Good Ideas on…

• Automatic partitioning

• Stateful load balancing

• Unified management of clones/partitions at both app and OS level

Page 34: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

34

Three Talks

• Scalability Terminology– Gray (with help from Devlin, Laing, Spix)

• What Windows is doing re this– Laing

• The M$ PetaByte (as time allows)

– Gray

Page 35: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

35

We're building Petabyte Stores

• Soon everything can be recorded and indexed

• Hotmail 100TB now• MSN 100TB now• List price is 800M$/PB

(including FC switches & brains)• Must Geoplex it.• Can we get if for 1M$/PB?• Personal 1TB stores for 1k$

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

KiloA BookA Book

.Movie

All LoC books(words)

All Books MultiMedia

Everything!

Recorded

A PhotoA Photo

24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Page 36: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

360

100

200

300

400

500

EMC Dell/3ware

Building a Petabyte Store• EMC ~ 500k$/TB = 500 M$/PB

plus FC switches plus… 800 M$/PB• TPC-C SANs (Dell 18GB/…) 62 M$/PB• Dell local SCSI, 3ware 20 M$/PB• Do it yourself: 5 M$/PB• a billion here, a billion there,

soon your talking about real money!

Page 37: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

37

320 GB, 2k$ (now)6M$ / PB

• 4x80 GB IDE(2 hot plugable)– (1,000$)

• SCSI-IDE bridge– 200k$

• Box– 500 Mhz cpu– 256 MB SRAM– Fan, power, Enet– 500$

• Ethernet Switch: – 150$/port

• Or 8 disks/box640 GB for ~3K$ ( or 300 GB RAID)

Page 38: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

38

Hot Swap Drives for Archive or Data Interchange

• 25 MBps write(so can write N x 80 GB in 3 hours)

• 80 GB/overnite

= ~N x 2 MB/second

@ 19.95$/nite

Compare to 1$/GB

via Internet

Page 39: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

39

A Storage Brick• 2 x 80GB disks• 500 Mhz cpu (intel/ amd/ arm)

• 256MB ram• 2 eNet RJ45• Fan(s)• Current disk form factor• 30 watt• 600$ (?) per rack (48U - 3U/module - 16 units/U)

400 disks, 200 whistler nodes 32 TB 100 Billion Instructions Per Second 120 K$/rack, 4 M$/PB,

per Petabyte (33 racks) 4 M$ 3 TeraOps (6,600 nodes) 13 k disk arms (1/2 TBps IO)

Page 40: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

40

What Software Do The Bricks Run?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other

– COM+ SOAP, BizTalk

• Huge leverage in high-level interfaces.• Same old distributed system story.

Infiniband /Gbps EhternetCLR

stre

ams

data

gram

s

RP

C?

Applications

CLR

streams

datagrams

RP

C ?

Applications

Page 41: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

41

Storage Rack in 2 years?• 300 arms

• 50TB (160 GB/arm)

• 24 racks48 storage processors2x6+1 in rack

• Disks = 2.5 GBps IO• Controllers = 1.2 GBps IO• Ports 500 MBps IO

• My suggestion: move the processors into the storage racks.

Page 42: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

42

Auto Manage Storage• 1980 rule of thumb:

– A DataAdmin per 10GB, SysAdmin per mips

• 2000 rule of thumb– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).

• Problem:– 5TB is 60k$ today, 10k$ in a few years.– Admin cost >> storage cost???

• Challenge: – Automate ALL storage admin tasks

Page 43: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

43

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.

• At 1GBps it takes 12 days!• Store it in two (or more) places online (on disk?).

A geo-plex• Scrub it continuously (look for errors)• On failure,

– use other copy until failure repaired, – refresh lost copy from safe copy.

• Can organize the two copies differently (e.g.: one by time, one by space)

Page 44: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

44

Call To Action• Lets work together to make storage bricks

– Low cost

– High function

• NAS (network attached storage) not SAN (storage area network)

• Ship NT8/CLR/IIS/SQL/Exchange/… with every disk drive

Page 45: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

45

Three Talks

• Scalability Terminology– Gray (with help from Devlin, Laing, Spix)

• What Windows is doing re this– Laing

• The M$ PetaByte (as time allows)

– Gray

Page 46: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

46

Cheap Storage• Disks are getting cheap:• 3 k$/TB disks (12 80 GB disks @ 250$ each)

y = 5.7156x + 47.857

y = 15.895x + 13.446

0

100

200

300

400

500

600

700

800

900

0 10 20 30 40 50 60Raw Disk unit Size GB

$

IDE

SCSI

Price vs disk capacity

7

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60Disk unit size GB

$

IDE

SCSI

raw k$/TB

y = 3.0635x + 40.542

y = 13.322x - 1.4332

0

100

200

300

400

500

600

700

800

900

1000

0 20 40 60 80Raw Disk unit Size GB

$

SCSI

IDE

Price vs disk capacity

0

5

10

15

20

25

30

35

40

0 20 40 60 80Disk unit size GB

$

SCSI

IDE

raw k$/TB

Page 47: 1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)

54

Tera Byte Backplane

• TODAY– Disk controller is 10 mips risc engine

with 2MB DRAM– NIC is similar power

• SOON– Will become 100 mips systems

with 100 MB DRAM.

• They are nodes in a federation(can run Oracle on NT in disk controller).

• Advantages– Uniform programming model– Great tools– Security– Economics (cyberbricks)– Move computation to data (minimize traffic)

All Device Controllers will be Super-Computers

CentralProcessor &

Memory