Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222
description
Transcript of Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222
1
Computers are Free, Now What?Premise:
You're a Fortune 1,000 CIOI’m a DB+OS guy selling CyberBricks
What can I say in an hour that you do not know?How can I help you plan for CyberBricks?
Jim Gray
Microsoft Research
http://research.Microsoft.com/~Gray
415 778 8222
2
Outline
• Why cost per transaction dropped 100,000x in 10 years.
• How does that change things?
• What next (technology trends)
• Clusters of Hardware and Software CyberBricks
3
Systems 30 Years Ago• MegaBuck per Mega Instruction Per Second (mips)
• MegaBuck per MagaByte
• Sys Admin & Data Admin per MegaBuck
4
Disks of 30 Years Ago
• 10 MB
• Failed every few weeks
5
1988: IBM DB2 + CICS Mainframe65 tps
• IBM 4391
• Simulated network of 800 clients
• 2m$ computer
• Staff of 6 to do benchmark2 x 3725
network controllers
16 GB disk farm
4 x 8 x .5GB
Refrigerator-sized CPU
6
1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem)• A dozen people (1.8M$/y)• False floor, 2 rooms of machines
Simulate 25,600 clients
32 node processor array
40 GB disk array (80 drives)
OS expert
Network expert
DB expert
Performance expert
Hardware experts
Admin expert
AuditorManager
7
1997: 9 years later1 Person and 1 box = 1250 tps
• 1 Breadbox ~ 5x 1987 machine room
• 23 GB is hand-held
• One person does all the work
• Cost/tps is 100,000x less5 micro dollars per transaction
4x200 Mhz cpu1/2 GB DRAM12 x 4GB disk
Hardware expertOS expertNet expertDB expertApp expert
3 x7 x 4GB disk arrays
8
Cost Per Transaction• Industry uses $/tps (or $/tpm):
5 year cost of hardware and software to get 1 tps.• There are about 1 Million seconds in 3 years• So, if $/tps is 1$,
$/t is 1 micro-dollar.• 1988: mini: 50K$/tps mainframe: 150k$/tps
– 5 cents to 15 cents per transaction
• 1998: micro: 30$/tpmc = 50¢/tpsC
–5 micro-dollars per transactionnote it is actually 6x less than this, tpcC is 6x tpcA
9
UNIX vs WindowsNT• Solaris on SPARC range 11,559 tpmC @ 57$/tpmc (Sybase)
to 51,871 tpmC @ 135 tpmC (Oracle)• SQL on NT/Compaq range 11,748 tpmC @ 27$/tmpC
to 18,129 tpmC @ 27 $/tpmC
• NT price per transaction is 2x to 4x less, peak performance per node is 3x less.
• Markup is in Oracle and SPARC (disk and DRAM prices OK.)
• Note:current NT prices are 27$/tpmC not 33 $/tpmC so 23% lower than shown
• UNIX is 5x less than MVS according to David Matthews, “Large Server TCO: The UNIX advantage”, Unix Review Feb 1998 Reseller Supplement, pp 3-11
TPC Price/tpmC
45
35
30
7
12
8
17
4 53
0
5
10
15
20
25
30
35
40
45
50
processor disk software net total/10
Sun Oracle 52 k tpmC @ 134$/tpmC
HP+ NT4 +SQL Server 16.2 ktpmC @ 33$/tpmC
10
mainframemini
micro
time
pric
e
What Happened?Where did the 100,000x come from?• Moore’s law: 100X (at most)
• Software improvements: 10X (at most)
• Commodity Pricing: 100X (at least)
• Total 100,000X
• 100x from commodity
– (DBMS was 100K$ to start: now 1k$ to start
– IBM 390 MIPS is 7.5K$ today
– Intel MIPS is 10$ today
– Commodity disk is 50$/GB vs 1,500$/GB
– ...
11
Outline
• Why cost per transaction has dropped 100,000x in 10 years.
• How does that change things?
• What next (technology trends)
• Clusters of Hardware and Software CyberBricks
12
What does 1 μ$/t Mean?• Human Attention is the precious resource.
• Content is the precious resource
• Impressions (eyeballs) sell for
10,000 μ $ to 100,000 μ $• All costs (and value) is in content and admin.
• Aside, this month, the TerraServer got 400M hits, 40 M impressionsa 2M$/mo asset (for satellite photos.)
• That’s why everyone is hot on portals.
13
Administration Costs• Vendor Rule of thumb (1970s mainframe)
– one systems programmer per MIPS– one data admin per 10 GB
• DataCenter Rule of thumb:– Hardware & Facilities is 40%– Labor is 60%– => 100 sys pgmrs and 1 data admin per laptop!
• 1995 Federal study of their data centers– 1 to 3 MIPS per admin! (http://research.microsoft.com/~gray/NC_Servers.doc)
• Thin client: – move admin to server– claim: save admin costs– reality: move admin costs to expensive fixed staff– Time will tell.
14
Content Costs• For most web sites
– Most staff are doing content– Admin is small fraction of content
• RULE OF THUMB:– Hardware/software/facilities/admin is 10% of content– Content is 90% of cost– This seems to apply to
• microsoft.com, msn, WebTV, HotMail, Inktomi• MAIN CONCLUSION
– Hardware, software, admin is in micro$/t range– Unix and mainframes are 2x or 10x more micro$– Who cares? Cost is in content– Look for content creation/management tools
15
Legacy Latency:a personal tale• 1970s helped company X covert to IMS/Fast Path
• 1980s helped company X experiment with Tandem mini-computers
• 1990s visit and ask: – Why are you still buying those mainframes?
• Answers:1. They are up all the time (99.99% up).
2. 25 years ago ROI was 18 months, now it is 1 week.
3.A rewrite would cost more than it would ever save.
4. My career would not survive a rewrite.
5. The devil you know is better than the devil you don’t.
16
Put Anther Way• You are ATT or the airlines industry or...
You do 300 M transactions/day• The capital cost of these transactions is
– 300 $/day on NT– 1,000 $/day on Solaris– 10,000 $/day on MVS
• Who cares? Revenue and costs are 200,000,000 $/daySo, transaction cost is .01% or .0001%.
• But, if productivity is higher on Solaris or NT…Or if tools exist on them, then….Or if cost of 2nd or 3rd environment is huge (staff), then...
• New apps should not go on MVS!• Investing in SNA? Investing in IMS? Investing in TPF?..
17
What Happens Next
• Last 10 years: 100,000x improvement
• Next 10 years: ????
• Today: text and image servers are free
25 $/hit => advertising pays for them
• Future:video, audio, … servers are free“You ain’t seen nothing yet!”
1985 20051995
perf
orm
ance
18
And So...• Traditional transaction processing is a zero-billion dollar industry --
• Growth is in new apps
Point-to-Point Broadcast
Immediate
TimeShifted
conversationmoney
lectureconcert
mail booknewspaper
NetNetworkwork
DataDataBaseBase
Its ALL going electronicImmediate is being stored for analysis (so ALL database)Analysis & Automatic Processing are being added
19
Why Put Everything in Cyberspace?
Low rentmin $/byte
Shrinks timenow or later
Shrinks spacehere or there
Automate processingknowbots
Point-to-Point OR Broadcast
Imm
edia
te O
R T
ime
Del
ayed
Network
DataBase
LocateProcessAnalyzeSummarize
20
Some Tera-Byte DatabasesKilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
• The Web: 1 TB of HTML• TerraServer 1 TB of images• Many 1 TB (file) servers• Hotmail: 7 TB of email• Sloan Digital Sky Survey:
40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week)
– 15 PB by 2007
• Federal Clearing house: images of checks– 15 PB by 2006 (7 year history)
• Nuclear Stockpile Stewardship Program– 10 Exabytes (???!!)
22
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
A novel A letter
Library of Library of Congress Congress (text)(text)
All Disks
All Tapes
A Movie
LoC (image)
All Photos
LoC (sound + cinima)
All Information!
23
Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html
• Soon everything can be recorded and kept
• Most data will never be seen by humans
• Precious Resource: Human attention Auto-SummarizationAuto-Search
will be a key enabling technology.
24
Outline
• Why cost per transaction has dropped 100,000x in 10 years.
• How does that change things?
• What next (technology trends)
• Clusters of Hardware and Software CyberBricks
25
Technology (hardware)
NOW• CPU: nearing 1 BIPS
– but CPI rising fast (2-10) so less than 100 mips
– 1$/mips to 10$/mips
• DRAM: 3 $/MB• DISK: 30 $/GB• TAPE:
– 20 GB/tape, 6 MBps
– Lags disk
– 2$/GB offline, 15$/GB nearline
2003 Forecast (10x better)
• CPU: 1BIPS real (smp)– 0.1$ - 1$/mips
• DRAM: 1 Gb chip – 0.1 $/MB
• Disk: – 10 GB smart cards
500GB RAID packs (NTinside)
– 3$ GB
• Tape– ?
26
System On A Chip• Integrate Processing with memory on one chip
– chip is 75% memory now– 1MB cache >> 1960 supercomputers– 256 Mb memory chip is 32 MB!– IRAM, CRAM, PIM,… projects abound
• Integrate Networking with processing on one chip– system bus is a kind of network– ATM, FiberChannel, Ethernet,.. Logic on chip.– Direct IO (no intermediate bus)
• Functionally specialized cards shrink to a chip.
27
ThesisMany little beat few big
Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?
$1 $1 millionmillion $100 K$100 K $10 K$10 K
MainframeMainframe MiniMiniMicroMicro NanoNano
14"14"9"9"
5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPEC marks, 1TFLOP1 M SPEC marks, 1TFLOP
101066 clocks to bulk ram clocks to bulk ram
Event-horizon on chipEvent-horizon on chip
VM reincarnatedVM reincarnated
Multi-program cache,Multi-program cache,On-Chip SMPOn-Chip SMP
10 microsecond ram
10 millisecond disc
10 second tape archive
10 nano-second ram
Pico Processor
10 pico-second ram
1 MM 3
100 TB
1 TB
10 GB
1 MB
100 MB
28
Storage Latency: How Far Away is the Data?
Storage Latency: How Far Away is the Data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
109
106
This Campus
This Room10 min
My Head 1 min
1.5 hrSacramento
2 YearsPluto
2,000 YearsAndromeda
29
Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years
• Today: – 10 Gbps per channel– 4 channels per fiber: 40 Gbps– 32 fibers/bundle = 1.2 Tbps/bundle
• In lab 3 Tbps/fiber (400 x WDM)• In theory 25 Tbps per fiber• 1 Tbps = USA 1996 WAN bisection bandwidth
1 fiber = 25 Tbps
30
• CHALLENGE– reduce software tax
on messages– Today 30 K ins
+ 10 ins/byte
– Goal: 1 K ins + .01 ins/byte
• Best bet:– SAN/VIA
– Smart NICs
– Special protocol – User-Level Net IO (like disk)
NetworkingBIG!! Changes coming!
• Technology– 10 GBps bus “now”– 1 Gbps links “now”– 1 Tbps links in 10 years– Fast & cheap switches
• Standard interconnects– processor-processor– processor-device (=processor)
• Deregulation WILL work someday
31
What if Networking Was as Cheap As Disk IO?
• TCP/IP– Unix/NT
100% cpu @ 40MBps
• Disk– Unix/NT
8% cpu @ 40MBps
Why the Difference?Host Bus Adapter does
SCSI packetizing, checksum,…flow controlDMA
Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers
32
The Promise of SAN/VIA10x better in 2 years
• Today: – wires are 10 MBps (100 Mbps Ethernet)
– ~20 MBps tcp/ip saturates 2 cpus– round-trip latency is ~300 us
• In two years– wires are 100 MBps (1 Gbps Ethernet, ServerNet,…)– tcp/ip ~ 100 MBps 10% of each processor– round-trip latency is 20 us
• works in lab todayuses Winsock2 api.See http://www.viarch.org/
0
50
100
150
200
250
Bandwidth Latency Overhead
NowSoon
33
Gbps Ethernet: 110 MBps
SAN: Standard Interconnect
PCI: 70 MBps
UW Scsi: 40 MBps
FW scsi: 20 MBps
scsi: 5 MBps
• LAN faster than memory bus?
• 1 GBps links in lab.
• 100$ port cost soon
• Port is computer
RIPFDDI
RIPATM
RIPSCI
RIPSCSI
RIPFC
RIP?
34
Data Gravity Processing Moves to Transducers
• Move Processing to data sources
• Move to where the power (and sheet metal) is
• Processor in– Modem– Display– Microphones (speech recognition)
& cameras (vision)– Storage: Data storage and analysis
35
CyberBricks:Functionally Specialized Cards
• Storage
• Network
• Display
M MB DRAM
P mips processor
ASIC
ASIC
ASIC
Today:
P= 20 mips
M= 2 MB
In a few years
P= 200 mips
M= 64 MB
36
With Tera Byte Interconnectand Super Computer Adapters
• Processing is incidental to – Networking– Storage– UI
• Disk Controller/NIC is – faster than device– close to device– Can borrow device
package & power
• So use idle capacity for computation.• Run app in device.
Tera ByteBackplane
37
Tera Byte Backplane
• TODAY– Disk controller is 10 mips risc engine
with 2MB DRAM– NIC is similar power
• SOON– Will become 100 mips systems
with 100 MB DRAM.
• They are nodes in a federation(can run Oracle on NT in disk controller).
• Advantages– Uniform programming model– Great tools– Security– economics (CyberBricks)– Move computation to data (minimize traffic)
All Device Controllers will be Cray 1’s
CentralProcessor &
Memory
38
It’s Already True of PrintersPeripheral = CyberBrick
• You buy a printer
• You get a– several network interfaces– A Postscript engine
• cpu, • memory, • software,• a spooler (soon)
– and… a print engine.
49
Disk = Node• has magnetic storage (100 GB?)
• has processor & DRAM
• has SAN attachment
• has execution environment
OS KernelSAN driver Disk driver
File System RPC, ...Services DBMS
Applications
50
Outline
• Why cost per transaction has dropped 100,000x in 10 years.
• How does that change things?
• What next (technology trends): CyberBricks
• Clusters of Hardware and Software CyberBricks
51
All God’s Children Have Clusters!Buying Computing By the Slice
• People are buying computers by the dozens– Computers only cost 1k$/slice!
• Clustering them together
52
A cluster is a cluster is a cluster • It’s so natural,
even mainframes cluster !Looking closer at usage patterns, a few models emerge
• Looking closer at sites, you see hierarchies bunches functional specialization
53
“Commercial” NT Clusters
• 16-node Tandem Cluster– 64 cpus
– 2 TB of disk
– Decision support
• 45-node Compaq Cluster– 140 cpus
– 14 GB DRAM
– 4 TB RAID disk
– OLTP (Debit Credit)• 1 B tpd (14 k tps)
54
Tandem Oracle/NT
• 27,383 tpmC
• 71.50 $/tpmC
• 4 x 6 cpus
• 384 disks=2.7 TB
55
Microsoft.com: ~150x4 nodes
SwitchedEthernet
SwitchedEthernet
www.microsoft.com(3)
search.microsoft.com(1)
premium.microsoft.com(1)
European Data Center
FTPDownload Server
(1)
SQL SERVERS(2)
Router
msid.msn.com(1)
MOSWestAdmin LAN
SQLNetFeeder LAN
FDDI Ring(MIS4)
Router
www.microsoft.com(5)
Building 11
Live SQL Server
Router
home.microsoft.com(5)
FDDI Ring(MIS2)
www.microsoft.com(4)
activex.microsoft.com(2)
search.microsoft.com(3)
register.microsoft.com(2)
msid.msn.com(1)
FDDI Ring(MIS3)
www.microsoft.com(3)
premium.microsoft.com(1)
msid.msn.com(1)
FDDI Ring(MIS1)
www.microsoft.com(4)
premium.microsoft.com(2)
register.microsoft.com(2)
msid.msn.com(1) Primary
Gigaswitch
SecondaryGigaswitch
Staging Servers(7)
search.microsoft.com(3)
support.microsoft.com(2)
register.msn.com(2)
The Microsoft.Com Site
MOSWest
DMZ Staging Servers
\\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd12/15/97
Internet
Internet
Log Processing
All servers in Building11are accessable fromcorpnet.
IDC Staging Servers
Live SQL Servers
SQL Consolidators
Japan Data Centerwww.microsoft.com
(3)premium.microsoft.com(1)
HTTPDownload Servers
(2) Router
search.microsoft.com(2)
SQL SERVERS(2)
msid.msn.com(1)
FTPDownload Server
(1)Router
Router
Router
Router
Router
Router
Router
Router
Internal WWW
SQL Reporting
home.microsoft.com(4)
home.microsoft.com(3)
home.microsoft.com(2)
register.microsoft.com(1)
support.microsoft.com(1)
Internet
13DS3
(45 Mb/Sec Each)
2OC3
(100Mb/Sec Each)
2Ethernet
(100 Mb/Sec Each)
cdm.microsoft.com(1)
FTP Servers
DownloadReplication
Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $83KFY98 Fcst: 12
Ave CFG: 4xP5,256 RAM,12 GB HDAve Cost: $24KFY98 Fcst: 0
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 3
Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 17
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $43KFY98 Fcst: 10
Ave CFG: 4xP6512 RAM28 GB HDAve Cost: $35KFY98 Fcst: 17 Ave CFG: 4xP6,
256 RAM,30 GB HDAve Cost: $25KFY98 Fcst: 2
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 3
Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $35KFY98 Fcst: 2
Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 12
Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 2
Ave CFG: 4xP6,1 GB RAM,180 GB HDAve Cost: $128KFY98 Fcst: 2
Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 7
Ave CFG: 4xP5,256 RAM,20 GB HDAve Cost: $29KFY98 Fcst: 2
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 9
Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1
Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1
Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1
Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1
FTP.microsoft.com(3)
Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1
Ave CFG: 4xP6,1 GB RAM,160 GB HDAve Cost: $83KFY98 Fcst: 2
56
The Microsoft TerraServer Hardware
• Compaq AlphaServer 8400Compaq AlphaServer 8400• 8x400Mhz Alpha cpus8x400Mhz Alpha cpus• 10 GB DRAM10 GB DRAM• 324 9.2 GB StorageWorks Disks324 9.2 GB StorageWorks Disks
– 3 TB raw, 2.4 TB of RAID53 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)STK 9710 tape robot (4 TB)• WindowsNT 4 EE, SQL Server 7.0WindowsNT 4 EE, SQL Server 7.0
57
HotMail: ~400 Computers
LocalDirector
Front Door(P-200, 128MB)140 +10/mo
FreeBSD/Apache
200
MB
ps I
nter
net l
ink
Graphics15xP6
FreeBSD/Hotmail
Ad10xP6
FreeBSD/Apache
Incoming Mail25xP-200
FreeBSD/hm-SMTP
LocalDirector
LocalDirector
LocalDirector
Security2xP200-FreeBSD
Member Dir
U StoreE3k,xxMB, 384GB RAID5 +
DLT tape robotSolaris/HMNNFS
50 machines, many old13 + 1.5/mo 1 per million users
Ad Pacer3 P6
FreeBSD
Cisco Catalyst 5000Enet Switch
Loc
al 1
0 M
bps
Sw
itch
ed E
ther
net
M Serv(SPAC Ultra-1, ??MB)
4- replicasSolaris
TelnetMaintenance
Interface
58
Inktomi (hotbot), WebTV: > 200 nodes• Inktomi: ~250 UltraSparcs
– web crawl– index crawled web and save index– Return search results on demand– Track Ads and click-thrus – ACID vs BASE (basic Availability, Serialized Eventually)
• Web TV– ~200 UltraSparcs
• Render pages, Provide Email
– ~ 4 Network Appliance NFS file servers– A large Oracle app tracking customers
59
Loki: Pentium Clusters for Science
http://loki-www.lanl.gov/
16 Pentium Pro Processorsx 5 Fast Ethernet interfaces+ 2 Gbytes RAM+ 50 Gbytes Disk+ 2 Fast Ethernet switches+ Linux…………………...
= 1.2 real Gflops for $63,000(but that is the 1996 price)
Beowulf project is similarhttp://cesdis.gsfc.nasa.gov/pub/people/becker/
beowulf.html
• Scientists want cheap mips.
60
• Intel/Sandia: 9000x1 node Ppro
• LLNL/IBM: 512x8 PowerPC (SP2)
• LNL/Cray: ?
• Maui Supercomputer Center– 512x1 SP2
Your Tax Dollars At WorkASCI for Stockpile Stewardship
61
Berkeley NOW (network of workstations) Projecthttp://now.cs.berkeley.edu/
• 105 nodes– Sun UltraSparc 170,
128 MB, 2x2GB disk
– Myrinet interconnect (2x160MBps per node)
– SBus (30MBps) limited
• GLUNIX layer above Solaris
• Inktomi (HotBot search)
• NAS Parallel Benchmarks
• Crypto cracker
• Sort 9 GB per second
62
Wisconsin COW
• 40 UltraSparcs 64MB + 2x2GB disk+ Myrinet
• SUN OS• Used as a compute
engine
63
Andrew Chien’s JBOBhttp://www-csag.cs.uiuc.edu/individual/achien.html
• 48 nodes
• 36 HP 2PIIx128 1 diskKayak boxes
• 10 Compaq 2PIIx128 1 disk, Wkstation 6000
• 32-Myrinet&16-ServerNet connected
• Operational
• All running NT
64
NCSA Super Cluster
• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana
• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
65
1.2 B tpd• 1 B tpd ran for 24 hrs.• Out-of-the-box software• Off-the-shelf hardware• AMAZING!• 20x smaller than Microsoft Internet Data Center (amazing!)
•Sized for 30 days•Linear growth•5 micro-dollars per transaction
66
Scalability1 billion 1 billion
transactionstransactions
1.8 million 1.8 million mail messagesmail messages
4 terabytes of 4 terabytes of datadata
100 million100 millionweb hitsweb hits
• Scale up: to large SMP nodesScale up: to large SMP nodes• Scale out: to clusters of SMP nodesScale out: to clusters of SMP nodes
67
4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)
The Bricks of Cyberspace• Cost 1,000 $
• Come with – NT
– DBMS
– High speed Net
– System management
– GUI / OOUI
– Tools
• Compatible with everyone else
• CyberBricks
68
Super Server: 4T Machine Array of 1,000 4B machinesArray of 1,000 4B machines
1 b ips processors1 b ips processors1 B B DRAM 1 B B DRAM 10 B B disks 10 B B disks 1 Bbps comm lines1 Bbps comm lines1 TB tape robot1 TB tape robot
A few megabucksA few megabucks Challenge:Challenge:
ManageabilityManageabilityProgrammabilityProgrammabilitySecuritySecurityAvailabilityAvailabilityScaleabilityScaleabilityAffordabilityAffordability
As easy as a single systemAs easy as a single systemFuture servers are CLUSTERSFuture servers are CLUSTERSof processors, discsof processors, discs
Distributed database techniquesDistributed database techniquesmake clusters workmake clusters work
CPU
50 GB Disc
5 GB RAM
Cyber BrickCyber Bricka 4B machinea 4B machine
69
Cluster VisionBuying Computers by the Slice
• Rack & Stack– Mail-order components
– Plug them into the cluster
• Modular growth without limits– Grow by adding small modules
• Fault tolerance: – Spare modules mask failures
• Parallel execution & data search– Use multiple processors and disks
• Clients and servers made from the same stuff– Inexpensive: built with
commodity CyberBricks
70
Nostalgia Behemoth in the Basement
• today’s PC is yesterday’s supercomputer
• Can use LOTS of them• Main Apps changed:
– scientific commercial web
– Web & Transaction servers
– Data Mining, Web Farming
71
SMP -> nUMA: BIG FAT SERVERS
• Directory based caching lets you build large SMPs
• Every vendor building a HUGE SMP – 256 way
– 3x slower remote memory
– 8-level memory hierarchy• L1, L2 cache• DRAM• remote DRAM (3, 6, 9,…)• Disk cache• Disk• Tape cache• Tape
• Needs– 64 bit addressing– nUMA sensitive OS
• (not clear who will do it)
• Or Hypervisor– like IBM LSF, – Stanford Disco
www-flash.stanford.edu/Hive/papers.html
• You get an expensive cluster-in-a-box with very fast network
72
Great Debate: Shared What?Shared Memory
(SMP)Shared Disk Shared Nothing
(network)CLIENTS CLIENTS
Easy to programDifficult to buildDifficult to scale
Hard to programEasy to buildEasy to scale
SGI, Sun, Sequent VMScluster, Sysplex Tandem, Teradata, SP2, NT
NUMA blurs distinction, but has it’s own problems
CLIENTS
73
Technology Drivers
Plug & Play Software• RPC is standardizing: (DCOM, IIOP, HTTP)
– Gives huge TOOL LEVERAGE– Solves the hard problems for you:
• naming, • security, • directory service, • operations,...
• Commoditized programming environments – FreeBSD, Linix, Solaris,…+ tools– NetWare + tools– WinCE, WinNT,…+ tools– JavaOS + tools
• Apps gravitate to data.• General purpose OS on controller runs apps.
74
RestatementThe huge clusters we saware prototypes for CyberBrick systems:
A Federation of
Functionally specialized nodesEach node shrinks to a “point” device
With embedded processing.Each node / device is autonomous
Each talks a high-level protocol
75
Outline• Clusters of Hardware CyberBricks
– all nodes are very intelligent– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
• Software CyberBricks– standard way to interconnect intelligent nodes– needs execution model– needs parallelism
76
Software CyberBricks: Objects!
• It’s a zoo• Objects and 3-tier computing (transactions)
– Give natural distribution & parallelism– Give remote management!– TP & Web: Dispatch RPCs to pool of object
servers– Components are a 1B$ business today!
• Need a Parallel & distributed computing model
77
The COMponent Promise• Objects are
Software CyberBricks– productivity breakthrough (plug ins)
– manageability breakthrough (modules)
• Microsoft: DCOM + ActiveX
• IBM/Sun/Oracle/Netscape: CORBA + Java Beans
• Both promise– parallel distributed execution
– centralized management of distributed system
Both campsShare key goals: • Encapsulation: hide implementation• Polymorphism: generic ops
key to GUI and reuse • Uniform Naming• Discovery: finding a service• Fault handling: transactions• Versioning: allow upgrades• Transparency: local/remote• Security: who has authority • Shrink-wrap: minimal inheritance• Automation: easy
89
The OO Points So Far• Objects are software Cyber Bricks
• Object interconnect standards are emerging
• Cyber Bricks become Federated Systems.
• Put processing close to data
• Next point:– do parallel processing.
90
Kinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential Program
Any Sequential Program
SequentialSequential
SequentialSequential Any Sequential Program
Any Sequential Program
91
Object Oriented ProgrammingParallelism From Many Little Jobs
• Gives location transparency
• ORB/web/tpmon multiplexes clients to servers
• Enables distribution
• Exploits embarrassingly parallel apps (transactions)
• HTTP and RPC (dcom, corba, rmi, iiop, …) are basis
Tp mon / orb/ web server
92
Why Parallel Access To Data?
1 Terabyte
10 MB/s
At 10 MB/s1.2 days to scan
1 Terabyte
1,000 x parallel100 second SCAN.
Parallelism: divide a big problem into many smaller ones
to be solved in parallel.
BANDWID
TH
98
Partitioned Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL parallelism
99
N x M way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.
Partitioned DataPartitioned and Pipelined Data Flows
100
Summary• Clusters of Hardware CyberBricks
– all nodes are very intelligent– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
• Software CyberBricks– standard way to interconnect intelligent nodes– needs execution model– needs parallelism
101
Summary
• Why cost per transaction has dropped 100,000x in 12 years.
• How does that change things?
• What next (technology trends)
• Hardware and Software CyberBricks
102
What I’m Doing• TerraServer: Photo of the planet on the web
– a database (not a file system)– 1TB now, 15 PB in 10 years– http://www.TerraServer.microsoft.com/
• Sloan Digital Sky Survey: picture of the universe– just getting started, cyberbricks for astronomers– http://www.sdss.org/
• Sorting: – one node pennysort (http://research.microsoft.com/barc/SortBenchmark/)
– multinode: NT Cluster sort (shows off SAN and DCOM)
103
What I’m Doing• NT Clusters:
– failover: Fault tolerance within a cluster– NT Cluster Sort: balanced IO, cpu, network benchmar– AlwaysUp: Geographical fault tolerance.
• RAGS: random testing of SQL systems– a bug finder
• Telepresence– Working with Gordon Bell on “the killer app”– FileCast and PowerCast– Cyberversity (international, on demand, free university)