Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

76
1 What? Premise: You're a Fortune 1,000 CIO I’m a DB+OS guy selling CyberBricks What can I say in an hour that you do not know? How can I help you plan for CyberBricks? Jim Gray Microsoft Research [email protected] http:// research.Microsoft.com/ ~Gray

description

Computers are Free, Now What? Premise: You're a Fortune 1,000 CIO I’m a DB+OS guy selling CyberBricks What can I say in an hour that you do not know? How can I help you plan for CyberBricks?. Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gray 415 778 8222. - PowerPoint PPT Presentation

Transcript of Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

Page 1: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

1

Computers are Free, Now What?Premise:

You're a Fortune 1,000 CIOI’m a DB+OS guy selling CyberBricks

What can I say in an hour that you do not know?How can I help you plan for CyberBricks?

Jim Gray

Microsoft Research

[email protected]

http://research.Microsoft.com/~Gray

415 778 8222

Page 2: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

2

Outline

• Why cost per transaction dropped 100,000x in 10 years.

• How does that change things?

• What next (technology trends)

• Clusters of Hardware and Software CyberBricks

Page 3: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

3

Systems 30 Years Ago• MegaBuck per Mega Instruction Per Second (mips)

• MegaBuck per MagaByte

• Sys Admin & Data Admin per MegaBuck

Page 4: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

4

Disks of 30 Years Ago

• 10 MB

• Failed every few weeks

Page 5: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

5

1988: IBM DB2 + CICS Mainframe65 tps

• IBM 4391

• Simulated network of 800 clients

• 2m$ computer

• Staff of 6 to do benchmark2 x 3725

network controllers

16 GB disk farm

4 x 8 x .5GB

Refrigerator-sized CPU

Page 6: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

6

1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem)• A dozen people (1.8M$/y)• False floor, 2 rooms of machines

Simulate 25,600 clients

32 node processor array

40 GB disk array (80 drives)

OS expert

Network expert

DB expert

Performance expert

Hardware experts

Admin expert

AuditorManager

Page 7: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

7

1997: 9 years later1 Person and 1 box = 1250 tps

• 1 Breadbox ~ 5x 1987 machine room

• 23 GB is hand-held

• One person does all the work

• Cost/tps is 100,000x less5 micro dollars per transaction

4x200 Mhz cpu1/2 GB DRAM12 x 4GB disk

Hardware expertOS expertNet expertDB expertApp expert

3 x7 x 4GB disk arrays

Page 8: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

8

Cost Per Transaction• Industry uses $/tps (or $/tpm):

5 year cost of hardware and software to get 1 tps.• There are about 1 Million seconds in 3 years• So, if $/tps is 1$,

$/t is 1 micro-dollar.• 1988: mini: 50K$/tps mainframe: 150k$/tps

– 5 cents to 15 cents per transaction

• 1998: micro: 30$/tpmc = 50¢/tpsC

–5 micro-dollars per transactionnote it is actually 6x less than this, tpcC is 6x tpcA

Page 9: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

9

UNIX vs WindowsNT• Solaris on SPARC range 11,559 tpmC @ 57$/tpmc (Sybase)

to 51,871 tpmC @ 135 tpmC (Oracle)• SQL on NT/Compaq range 11,748 tpmC @ 27$/tmpC

to 18,129 tpmC @ 27 $/tpmC

• NT price per transaction is 2x to 4x less, peak performance per node is 3x less.

• Markup is in Oracle and SPARC (disk and DRAM prices OK.)

• Note:current NT prices are 27$/tpmC not 33 $/tpmC so 23% lower than shown

• UNIX is 5x less than MVS according to David Matthews, “Large Server TCO: The UNIX advantage”, Unix Review Feb 1998 Reseller Supplement, pp 3-11

TPC Price/tpmC

45

35

30

7

12

8

17

4 53

0

5

10

15

20

25

30

35

40

45

50

processor disk software net total/10

Sun Oracle 52 k tpmC @ 134$/tpmC

HP+ NT4 +SQL Server 16.2 ktpmC @ 33$/tpmC

Page 10: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

10

mainframemini

micro

time

pric

e

What Happened?Where did the 100,000x come from?• Moore’s law: 100X (at most)

• Software improvements: 10X (at most)

• Commodity Pricing: 100X (at least)

• Total 100,000X

• 100x from commodity

– (DBMS was 100K$ to start: now 1k$ to start

– IBM 390 MIPS is 7.5K$ today

– Intel MIPS is 10$ today

– Commodity disk is 50$/GB vs 1,500$/GB

– ...

Page 11: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

11

Outline

• Why cost per transaction has dropped 100,000x in 10 years.

• How does that change things?

• What next (technology trends)

• Clusters of Hardware and Software CyberBricks

Page 12: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

12

What does 1 μ$/t Mean?• Human Attention is the precious resource.

• Content is the precious resource

• Impressions (eyeballs) sell for

10,000 μ $ to 100,000 μ $• All costs (and value) is in content and admin.

• Aside, this month, the TerraServer got 400M hits, 40 M impressionsa 2M$/mo asset (for satellite photos.)

• That’s why everyone is hot on portals.

Page 13: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

13

Administration Costs• Vendor Rule of thumb (1970s mainframe)

– one systems programmer per MIPS– one data admin per 10 GB

• DataCenter Rule of thumb:– Hardware & Facilities is 40%– Labor is 60%– => 100 sys pgmrs and 1 data admin per laptop!

• 1995 Federal study of their data centers– 1 to 3 MIPS per admin! (http://research.microsoft.com/~gray/NC_Servers.doc)

• Thin client: – move admin to server– claim: save admin costs– reality: move admin costs to expensive fixed staff– Time will tell.

Page 14: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

14

Content Costs• For most web sites

– Most staff are doing content– Admin is small fraction of content

• RULE OF THUMB:– Hardware/software/facilities/admin is 10% of content– Content is 90% of cost– This seems to apply to

• microsoft.com, msn, WebTV, HotMail, Inktomi• MAIN CONCLUSION

– Hardware, software, admin is in micro$/t range– Unix and mainframes are 2x or 10x more micro$– Who cares? Cost is in content– Look for content creation/management tools

Page 15: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

15

Legacy Latency:a personal tale• 1970s helped company X covert to IMS/Fast Path

• 1980s helped company X experiment with Tandem mini-computers

• 1990s visit and ask: – Why are you still buying those mainframes?

• Answers:1. They are up all the time (99.99% up).

2. 25 years ago ROI was 18 months, now it is 1 week.

3.A rewrite would cost more than it would ever save.

4. My career would not survive a rewrite.

5. The devil you know is better than the devil you don’t.

Page 16: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

16

Put Anther Way• You are ATT or the airlines industry or...

You do 300 M transactions/day• The capital cost of these transactions is

– 300 $/day on NT– 1,000 $/day on Solaris– 10,000 $/day on MVS

• Who cares? Revenue and costs are 200,000,000 $/daySo, transaction cost is .01% or .0001%.

• But, if productivity is higher on Solaris or NT…Or if tools exist on them, then….Or if cost of 2nd or 3rd environment is huge (staff), then...

• New apps should not go on MVS!• Investing in SNA? Investing in IMS? Investing in TPF?..

Page 17: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

17

What Happens Next

• Last 10 years: 100,000x improvement

• Next 10 years: ????

• Today: text and image servers are free

25 $/hit => advertising pays for them

• Future:video, audio, … servers are free“You ain’t seen nothing yet!”

1985 20051995

perf

orm

ance

Page 18: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

18

And So...• Traditional transaction processing is a zero-billion dollar industry --

• Growth is in new apps

Point-to-Point Broadcast

Immediate

TimeShifted

conversationmoney

lectureconcert

mail booknewspaper

NetNetworkwork

DataDataBaseBase

Its ALL going electronicImmediate is being stored for analysis (so ALL database)Analysis & Automatic Processing are being added

Page 19: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

19

Why Put Everything in Cyberspace?

Low rentmin $/byte

Shrinks timenow or later

Shrinks spacehere or there

Automate processingknowbots

Point-to-Point OR Broadcast

Imm

edia

te O

R T

ime

Del

ayed

Network

DataBase

LocateProcessAnalyzeSummarize

Page 20: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

20

Some Tera-Byte DatabasesKilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

• The Web: 1 TB of HTML• TerraServer 1 TB of images• Many 1 TB (file) servers• Hotmail: 7 TB of email• Sloan Digital Sky Survey:

40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week)

– 15 PB by 2007

• Federal Clearing house: images of checks– 15 PB by 2006 (7 year history)

• Nuclear Stockpile Stewardship Program– 10 Exabytes (???!!)

Page 21: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

22

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

A novel A letter

Library of Library of Congress Congress (text)(text)

All Disks

All Tapes

A Movie

LoC (image)

All Photos

LoC (sound + cinima)

All Information!

Page 22: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

23

Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html

• Soon everything can be recorded and kept

• Most data will never be seen by humans

• Precious Resource: Human attention Auto-SummarizationAuto-Search

will be a key enabling technology.

Page 23: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

24

Outline

• Why cost per transaction has dropped 100,000x in 10 years.

• How does that change things?

• What next (technology trends)

• Clusters of Hardware and Software CyberBricks

Page 24: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

25

Technology (hardware)

NOW• CPU: nearing 1 BIPS

– but CPI rising fast (2-10) so less than 100 mips

– 1$/mips to 10$/mips

• DRAM: 3 $/MB• DISK: 30 $/GB• TAPE:

– 20 GB/tape, 6 MBps

– Lags disk

– 2$/GB offline, 15$/GB nearline

2003 Forecast (10x better)

• CPU: 1BIPS real (smp)– 0.1$ - 1$/mips

• DRAM: 1 Gb chip – 0.1 $/MB

• Disk: – 10 GB smart cards

500GB RAID packs (NTinside)

– 3$ GB

• Tape– ?

Page 25: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

26

System On A Chip• Integrate Processing with memory on one chip

– chip is 75% memory now– 1MB cache >> 1960 supercomputers– 256 Mb memory chip is 32 MB!– IRAM, CRAM, PIM,… projects abound

• Integrate Networking with processing on one chip– system bus is a kind of network– ATM, FiberChannel, Ethernet,.. Logic on chip.– Direct IO (no intermediate bus)

• Functionally specialized cards shrink to a chip.

Page 26: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

27

ThesisMany little beat few big

Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?

$1 $1 millionmillion $100 K$100 K $10 K$10 K

MainframeMainframe MiniMiniMicroMicro NanoNano

14"14"9"9"

5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPEC marks, 1TFLOP1 M SPEC marks, 1TFLOP

101066 clocks to bulk ram clocks to bulk ram

Event-horizon on chipEvent-horizon on chip

VM reincarnatedVM reincarnated

Multi-program cache,Multi-program cache,On-Chip SMPOn-Chip SMP

10 microsecond ram

10 millisecond disc

10 second tape archive

10 nano-second ram

Pico Processor

10 pico-second ram

1 MM 3

100 TB

1 TB

10 GB

1 MB

100 MB

Page 27: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

28

Storage Latency: How Far Away is the Data?

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

109

106

This Campus

This Room10 min

My Head 1 min

1.5 hrSacramento

2 YearsPluto

2,000 YearsAndromeda

Page 28: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

29

Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years

• Today: – 10 Gbps per channel– 4 channels per fiber: 40 Gbps– 32 fibers/bundle = 1.2 Tbps/bundle

• In lab 3 Tbps/fiber (400 x WDM)• In theory 25 Tbps per fiber• 1 Tbps = USA 1996 WAN bisection bandwidth

1 fiber = 25 Tbps

Page 29: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

30

• CHALLENGE– reduce software tax

on messages– Today 30 K ins

+ 10 ins/byte

– Goal: 1 K ins + .01 ins/byte

• Best bet:– SAN/VIA

– Smart NICs

– Special protocol – User-Level Net IO (like disk)

NetworkingBIG!! Changes coming!

• Technology– 10 GBps bus “now”– 1 Gbps links “now”– 1 Tbps links in 10 years– Fast & cheap switches

• Standard interconnects– processor-processor– processor-device (=processor)

• Deregulation WILL work someday

Page 30: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

31

What if Networking Was as Cheap As Disk IO?

• TCP/IP– Unix/NT

100% cpu @ 40MBps

• Disk– Unix/NT

8% cpu @ 40MBps

Why the Difference?Host Bus Adapter does

SCSI packetizing, checksum,…flow controlDMA

Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers

Page 31: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

32

The Promise of SAN/VIA10x better in 2 years

• Today: – wires are 10 MBps (100 Mbps Ethernet)

– ~20 MBps tcp/ip saturates 2 cpus– round-trip latency is ~300 us

• In two years– wires are 100 MBps (1 Gbps Ethernet, ServerNet,…)– tcp/ip ~ 100 MBps 10% of each processor– round-trip latency is 20 us

• works in lab todayuses Winsock2 api.See http://www.viarch.org/

0

50

100

150

200

250

Bandwidth Latency Overhead

NowSoon

Page 32: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

33

Gbps Ethernet: 110 MBps

SAN: Standard Interconnect

PCI: 70 MBps

UW Scsi: 40 MBps

FW scsi: 20 MBps

scsi: 5 MBps

• LAN faster than memory bus?

• 1 GBps links in lab.

• 100$ port cost soon

• Port is computer

RIPFDDI

RIPATM

RIPSCI

RIPSCSI

RIPFC

RIP?

Page 33: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

34

Data Gravity Processing Moves to Transducers

• Move Processing to data sources

• Move to where the power (and sheet metal) is

• Processor in– Modem– Display– Microphones (speech recognition)

& cameras (vision)– Storage: Data storage and analysis

Page 34: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

35

CyberBricks:Functionally Specialized Cards

• Storage

• Network

• Display

M MB DRAM

P mips processor

ASIC

ASIC

ASIC

Today:

P= 20 mips

M= 2 MB

In a few years

P= 200 mips

M= 64 MB

Page 35: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

36

With Tera Byte Interconnectand Super Computer Adapters

• Processing is incidental to – Networking– Storage– UI

• Disk Controller/NIC is – faster than device– close to device– Can borrow device

package & power

• So use idle capacity for computation.• Run app in device.

Tera ByteBackplane

Page 36: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

37

Tera Byte Backplane

• TODAY– Disk controller is 10 mips risc engine

with 2MB DRAM– NIC is similar power

• SOON– Will become 100 mips systems

with 100 MB DRAM.

• They are nodes in a federation(can run Oracle on NT in disk controller).

• Advantages– Uniform programming model– Great tools– Security– economics (CyberBricks)– Move computation to data (minimize traffic)

All Device Controllers will be Cray 1’s

CentralProcessor &

Memory

Page 37: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

38

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer

• You get a– several network interfaces– A Postscript engine

• cpu, • memory, • software,• a spooler (soon)

– and… a print engine.

Page 38: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

49

Disk = Node• has magnetic storage (100 GB?)

• has processor & DRAM

• has SAN attachment

• has execution environment

OS KernelSAN driver Disk driver

File System RPC, ...Services DBMS

Applications

Page 39: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

50

Outline

• Why cost per transaction has dropped 100,000x in 10 years.

• How does that change things?

• What next (technology trends): CyberBricks

• Clusters of Hardware and Software CyberBricks

Page 40: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

51

All God’s Children Have Clusters!Buying Computing By the Slice

• People are buying computers by the dozens– Computers only cost 1k$/slice!

• Clustering them together

Page 41: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

52

A cluster is a cluster is a cluster • It’s so natural,

even mainframes cluster !Looking closer at usage patterns, a few models emerge

• Looking closer at sites, you see hierarchies bunches functional specialization

Page 42: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

53

“Commercial” NT Clusters

• 16-node Tandem Cluster– 64 cpus

– 2 TB of disk

– Decision support

• 45-node Compaq Cluster– 140 cpus

– 14 GB DRAM

– 4 TB RAID disk

– OLTP (Debit Credit)• 1 B tpd (14 k tps)

Page 43: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

54

Tandem Oracle/NT

• 27,383 tpmC

• 71.50 $/tpmC

• 4 x 6 cpus

• 384 disks=2.7 TB

Page 44: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

55

Microsoft.com: ~150x4 nodes

SwitchedEthernet

SwitchedEthernet

www.microsoft.com(3)

search.microsoft.com(1)

premium.microsoft.com(1)

European Data Center

FTPDownload Server

(1)

SQL SERVERS(2)

Router

msid.msn.com(1)

MOSWestAdmin LAN

SQLNetFeeder LAN

FDDI Ring(MIS4)

Router

www.microsoft.com(5)

Building 11

Live SQL Server

Router

home.microsoft.com(5)

FDDI Ring(MIS2)

www.microsoft.com(4)

activex.microsoft.com(2)

search.microsoft.com(3)

register.microsoft.com(2)

msid.msn.com(1)

FDDI Ring(MIS3)

www.microsoft.com(3)

premium.microsoft.com(1)

msid.msn.com(1)

FDDI Ring(MIS1)

www.microsoft.com(4)

premium.microsoft.com(2)

register.microsoft.com(2)

msid.msn.com(1) Primary

Gigaswitch

SecondaryGigaswitch

Staging Servers(7)

search.microsoft.com(3)

support.microsoft.com(2)

register.msn.com(2)

The Microsoft.Com Site

MOSWest

DMZ Staging Servers

\\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd12/15/97

Internet

Internet

Log Processing

All servers in Building11are accessable fromcorpnet.

IDC Staging Servers

Live SQL Servers

SQL Consolidators

Japan Data Centerwww.microsoft.com

(3)premium.microsoft.com(1)

HTTPDownload Servers

(2) Router

search.microsoft.com(2)

SQL SERVERS(2)

msid.msn.com(1)

FTPDownload Server

(1)Router

Router

Router

Router

Router

Router

Router

Router

Internal WWW

SQL Reporting

home.microsoft.com(4)

home.microsoft.com(3)

home.microsoft.com(2)

register.microsoft.com(1)

support.microsoft.com(1)

Internet

13DS3

(45 Mb/Sec Each)

2OC3

(100Mb/Sec Each)

2Ethernet

(100 Mb/Sec Each)

cdm.microsoft.com(1)

FTP Servers

DownloadReplication

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $83KFY98 Fcst: 12

Ave CFG: 4xP5,256 RAM,12 GB HDAve Cost: $24KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 3

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 17

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $43KFY98 Fcst: 10

Ave CFG: 4xP6512 RAM28 GB HDAve Cost: $35KFY98 Fcst: 17 Ave CFG: 4xP6,

256 RAM,30 GB HDAve Cost: $25KFY98 Fcst: 2

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 3

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $35KFY98 Fcst: 2

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 12

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 2

Ave CFG: 4xP6,1 GB RAM,180 GB HDAve Cost: $128KFY98 Fcst: 2

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 7

Ave CFG: 4xP5,256 RAM,20 GB HDAve Cost: $29KFY98 Fcst: 2

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 9

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1

FTP.microsoft.com(3)

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1

Ave CFG: 4xP6,1 GB RAM,160 GB HDAve Cost: $83KFY98 Fcst: 2

Page 45: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

56

The Microsoft TerraServer Hardware

• Compaq AlphaServer 8400Compaq AlphaServer 8400• 8x400Mhz Alpha cpus8x400Mhz Alpha cpus• 10 GB DRAM10 GB DRAM• 324 9.2 GB StorageWorks Disks324 9.2 GB StorageWorks Disks

– 3 TB raw, 2.4 TB of RAID53 TB raw, 2.4 TB of RAID5

• STK 9710 tape robot (4 TB)STK 9710 tape robot (4 TB)• WindowsNT 4 EE, SQL Server 7.0WindowsNT 4 EE, SQL Server 7.0

Page 46: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

57

HotMail: ~400 Computers

LocalDirector

Front Door(P-200, 128MB)140 +10/mo

FreeBSD/Apache

200

MB

ps I

nter

net l

ink

Graphics15xP6

FreeBSD/Hotmail

Ad10xP6

FreeBSD/Apache

Incoming Mail25xP-200

FreeBSD/hm-SMTP

LocalDirector

LocalDirector

LocalDirector

Security2xP200-FreeBSD

Member Dir

U StoreE3k,xxMB, 384GB RAID5 +

DLT tape robotSolaris/HMNNFS

50 machines, many old13 + 1.5/mo 1 per million users

Ad Pacer3 P6

FreeBSD

Cisco Catalyst 5000Enet Switch

Loc

al 1

0 M

bps

Sw

itch

ed E

ther

net

M Serv(SPAC Ultra-1, ??MB)

4- replicasSolaris

TelnetMaintenance

Interface

Page 47: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

58

Inktomi (hotbot), WebTV: > 200 nodes• Inktomi: ~250 UltraSparcs

– web crawl– index crawled web and save index– Return search results on demand– Track Ads and click-thrus – ACID vs BASE (basic Availability, Serialized Eventually)

• Web TV– ~200 UltraSparcs

• Render pages, Provide Email

– ~ 4 Network Appliance NFS file servers– A large Oracle app tracking customers

Page 48: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

59

Loki: Pentium Clusters for Science

http://loki-www.lanl.gov/

16 Pentium Pro Processorsx 5 Fast Ethernet interfaces+ 2 Gbytes RAM+ 50 Gbytes Disk+ 2 Fast Ethernet switches+ Linux…………………...

= 1.2 real Gflops for $63,000(but that is the 1996 price)

Beowulf project is similarhttp://cesdis.gsfc.nasa.gov/pub/people/becker/

beowulf.html

• Scientists want cheap mips.

Page 49: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

60

• Intel/Sandia: 9000x1 node Ppro

• LLNL/IBM: 512x8 PowerPC (SP2)

• LNL/Cray: ?

• Maui Supercomputer Center– 512x1 SP2

Your Tax Dollars At WorkASCI for Stockpile Stewardship

Page 50: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

61

Berkeley NOW (network of workstations) Projecthttp://now.cs.berkeley.edu/

• 105 nodes– Sun UltraSparc 170,

128 MB, 2x2GB disk

– Myrinet interconnect (2x160MBps per node)

– SBus (30MBps) limited

• GLUNIX layer above Solaris

• Inktomi (HotBot search)

• NAS Parallel Benchmarks

• Crypto cracker

• Sort 9 GB per second

Page 51: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

62

Wisconsin COW

• 40 UltraSparcs 64MB + 2x2GB disk+ Myrinet

• SUN OS• Used as a compute

engine

Page 52: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

63

Andrew Chien’s JBOBhttp://www-csag.cs.uiuc.edu/individual/achien.html

• 48 nodes

• 36 HP 2PIIx128 1 diskKayak boxes

• 10 Compaq 2PIIx128 1 disk, Wkstation 6000

• 32-Myrinet&16-ServerNet connected

• Operational

• All running NT

Page 53: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

64

NCSA Super Cluster

• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana

• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model

http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html

Page 54: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

65

1.2 B tpd• 1 B tpd ran for 24 hrs.• Out-of-the-box software• Off-the-shelf hardware• AMAZING!• 20x smaller than Microsoft Internet Data Center (amazing!)

•Sized for 30 days•Linear growth•5 micro-dollars per transaction

Page 55: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

66

Scalability1 billion 1 billion

transactionstransactions

1.8 million 1.8 million mail messagesmail messages

4 terabytes of 4 terabytes of datadata

100 million100 millionweb hitsweb hits

• Scale up: to large SMP nodesScale up: to large SMP nodes• Scale out: to clusters of SMP nodesScale out: to clusters of SMP nodes

Page 56: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

67

4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)

The Bricks of Cyberspace• Cost 1,000 $

• Come with – NT

– DBMS

– High speed Net

– System management

– GUI / OOUI

– Tools

• Compatible with everyone else

• CyberBricks

Page 57: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

68

Super Server: 4T Machine Array of 1,000 4B machinesArray of 1,000 4B machines

1 b ips processors1 b ips processors1 B B DRAM 1 B B DRAM 10 B B disks 10 B B disks 1 Bbps comm lines1 Bbps comm lines1 TB tape robot1 TB tape robot

A few megabucksA few megabucks Challenge:Challenge:

ManageabilityManageabilityProgrammabilityProgrammabilitySecuritySecurityAvailabilityAvailabilityScaleabilityScaleabilityAffordabilityAffordability

As easy as a single systemAs easy as a single systemFuture servers are CLUSTERSFuture servers are CLUSTERSof processors, discsof processors, discs

Distributed database techniquesDistributed database techniquesmake clusters workmake clusters work

CPU

50 GB Disc

5 GB RAM

Cyber BrickCyber Bricka 4B machinea 4B machine

Page 58: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

69

Cluster VisionBuying Computers by the Slice

• Rack & Stack– Mail-order components

– Plug them into the cluster

• Modular growth without limits– Grow by adding small modules

• Fault tolerance: – Spare modules mask failures

• Parallel execution & data search– Use multiple processors and disks

• Clients and servers made from the same stuff– Inexpensive: built with

commodity CyberBricks

Page 59: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

70

Nostalgia Behemoth in the Basement

• today’s PC is yesterday’s supercomputer

• Can use LOTS of them• Main Apps changed:

– scientific commercial web

– Web & Transaction servers

– Data Mining, Web Farming

Page 60: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

71

SMP -> nUMA: BIG FAT SERVERS

• Directory based caching lets you build large SMPs

• Every vendor building a HUGE SMP – 256 way

– 3x slower remote memory

– 8-level memory hierarchy• L1, L2 cache• DRAM• remote DRAM (3, 6, 9,…)• Disk cache• Disk• Tape cache• Tape

• Needs– 64 bit addressing– nUMA sensitive OS

• (not clear who will do it)

• Or Hypervisor– like IBM LSF, – Stanford Disco

www-flash.stanford.edu/Hive/papers.html

• You get an expensive cluster-in-a-box with very fast network

Page 61: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

72

Great Debate: Shared What?Shared Memory

(SMP)Shared Disk Shared Nothing

(network)CLIENTS CLIENTS

Easy to programDifficult to buildDifficult to scale

Hard to programEasy to buildEasy to scale

SGI, Sun, Sequent VMScluster, Sysplex Tandem, Teradata, SP2, NT

NUMA blurs distinction, but has it’s own problems

CLIENTS

Page 62: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

73

Technology Drivers

Plug & Play Software• RPC is standardizing: (DCOM, IIOP, HTTP)

– Gives huge TOOL LEVERAGE– Solves the hard problems for you:

• naming, • security, • directory service, • operations,...

• Commoditized programming environments – FreeBSD, Linix, Solaris,…+ tools– NetWare + tools– WinCE, WinNT,…+ tools– JavaOS + tools

• Apps gravitate to data.• General purpose OS on controller runs apps.

Page 63: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

74

RestatementThe huge clusters we saware prototypes for CyberBrick systems:

A Federation of

Functionally specialized nodesEach node shrinks to a “point” device

With embedded processing.Each node / device is autonomous

Each talks a high-level protocol

Page 64: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

75

Outline• Clusters of Hardware CyberBricks

– all nodes are very intelligent– Processing migrates to where the power is

• Disk, network, display controllers have full-blown OS

• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them

• Computer is a federated distributed system.

• Software CyberBricks– standard way to interconnect intelligent nodes– needs execution model– needs parallelism

Page 65: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

76

Software CyberBricks: Objects!

• It’s a zoo• Objects and 3-tier computing (transactions)

– Give natural distribution & parallelism– Give remote management!– TP & Web: Dispatch RPCs to pool of object

servers– Components are a 1B$ business today!

• Need a Parallel & distributed computing model

Page 66: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

77

The COMponent Promise• Objects are

Software CyberBricks– productivity breakthrough (plug ins)

– manageability breakthrough (modules)

• Microsoft: DCOM + ActiveX

• IBM/Sun/Oracle/Netscape: CORBA + Java Beans

• Both promise– parallel distributed execution

– centralized management of distributed system

Both campsShare key goals: • Encapsulation: hide implementation• Polymorphism: generic ops

key to GUI and reuse • Uniform Naming• Discovery: finding a service• Fault handling: transactions• Versioning: allow upgrades• Transparency: local/remote• Security: who has authority • Shrink-wrap: minimal inheritance• Automation: easy

Page 67: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

89

The OO Points So Far• Objects are software Cyber Bricks

• Object interconnect standards are emerging

• Cyber Bricks become Federated Systems.

• Put processing close to data

• Next point:– do parallel processing.

Page 68: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

90

Kinds of Parallel Execution

Pipeline

Partition outputs split N ways inputs merge M ways

Any Sequential Program

Any Sequential Program

SequentialSequential

SequentialSequential Any Sequential Program

Any Sequential Program

Page 69: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

91

Object Oriented ProgrammingParallelism From Many Little Jobs

• Gives location transparency

• ORB/web/tpmon multiplexes clients to servers

• Enables distribution

• Exploits embarrassingly parallel apps (transactions)

• HTTP and RPC (dcom, corba, rmi, iiop, …) are basis

Tp mon / orb/ web server

Page 70: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

92

Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel100 second SCAN.

Parallelism: divide a big problem into many smaller ones

to be solved in parallel.

BANDWID

TH

Page 71: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

98

Partitioned Execution

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Count

Spreads computation and IO among processors

Partitioned data gives NATURAL parallelism

Page 72: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

99

N x M way Parallelism

A...E F...J K...N O...S T...Z

Merge

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Merge Merge

N inputs, M outputs, no bottlenecks.

Partitioned DataPartitioned and Pipelined Data Flows

Page 73: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

100

Summary• Clusters of Hardware CyberBricks

– all nodes are very intelligent– Processing migrates to where the power is

• Disk, network, display controllers have full-blown OS

• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them

• Computer is a federated distributed system.

• Software CyberBricks– standard way to interconnect intelligent nodes– needs execution model– needs parallelism

Page 74: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

101

Summary

• Why cost per transaction has dropped 100,000x in 12 years.

• How does that change things?

• What next (technology trends)

• Hardware and Software CyberBricks

Page 75: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

102

What I’m Doing• TerraServer: Photo of the planet on the web

– a database (not a file system)– 1TB now, 15 PB in 10 years– http://www.TerraServer.microsoft.com/

• Sloan Digital Sky Survey: picture of the universe– just getting started, cyberbricks for astronomers– http://www.sdss.org/

• Sorting: – one node pennysort (http://research.microsoft.com/barc/SortBenchmark/)

– multinode: NT Cluster sort (shows off SAN and DCOM)

Page 76: Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

103

What I’m Doing• NT Clusters:

– failover: Fault tolerance within a cluster– NT Cluster Sort: balanced IO, cpu, network benchmar– AlwaysUp: Geographical fault tolerance.

• RAGS: random testing of SQL systems– a bug finder

• Telepresence– Working with Gordon Bell on “the killer app”– FileCast and PowerCast– Cyberversity (international, on demand, free university)