1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe...
-
Upload
penelope-welch -
Category
Documents
-
view
223 -
download
4
Transcript of 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe...
1
BARC
BARCMicrosoft Bay Area Research Center
Tom BarclayTyler Beam (U VA)*Gordon BellJoe BarreraJosh Coates (UCB)* Jim Gemmell Jim GraySteve LuccoErik Riedel (CMU)*Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen (NTFS)*
http://www.research.Microsoft.com/barc/
2
Overview
• Telepresence»Goals
»Prototypes
• Rags: automating software testing
• Scaleable Systems.»Goals
»Prototypes
• Misc.
3
4
Telepresence: The next Killer App
• Space shifting: »Reduce travel
• Time shifting: »Retrospectives
»Condensations
»Just in time meetings.
• Example: ACM 97 »http://research.Microsoft.com/barc/acm97/
»NetShow and Web site.
»More web visitors than attendees
5
What We Are Doing• Scalable Reliable Multicast (SRM)
»used by WB (white board) of Mbone
»Nack suppression (backoff)
»N2 message traffic to set up
• Error Correcting SRM (EC SRM)Error Correcting SRM (EC SRM)»Do not resend lost packets.
»Send Error Correction in addition to regular
»(or)Send Error Correction in response to NACK
»One EC packet repairs any of k lost packets
»Improved scaleability (millions of subscribers).
6
Telepresence Prototypes• PowerCast: multicast PowerPoint
» Streaming - pre-sends next anticipated slide» Send slides and voice rather than talking head and voice» Uses ECSRM for reliable multicast» 1000’s of receivers can join and leave any time.» No server needed; no pre-load of slides.» Cooperating with NetShow
• FileCast: multicast file transfer.» Erasure encodes all packets» Receivers only need to receive as many bytes
as the length of the file» Multicast IE to solve Midnight-Madness problem
• NT SRM: reliable IP multicast library for NT
• Spatialized Teleconference Station» Texture map faces onto spheres
» Space map voices
7
IP Multicast• Is pruned broadcast to a multicast address
• Unreliable
• Reliable would require Ack/Nack.
• State or Nack implosion problem
routerrouter routerrouter
routerrouter
=sender=sender =receiver=receiver =not interested=not interested
routerrouter
8
(n,k) encodingOriginal packetsOriginal packets
1 2 k
1 2 k k+1k+2 n
Encode Encode (copy 1st k)(copy 1st k)
1 2 k Original packetsOriginal packets
DecodeDecode
Take any kTake any k
9
Fcast
• File tranfer protocol
• FEC-only
• Files transmitted in parallel
10
Fcast send order
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 nFile 1File 1
File 2File 2
X Need k from Need k from each roweach row
11
ECSRM - Erasure Correcting SRM
• Combines:
» suppression
» erasure correction
12
Suppression
• Delay a NACK or repair in the hopes that someone else will do it.
• NACKs are multicast
• After NACKing, re-set timer and wait for repair
• If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.
13
ECSRM - adding FEC to suppression
• Assign each packet to an EC group of size k
• NACK: (group, # missing)
• NACK of (g,c) suppresses all (g,xc).
• Don’t re-send originals; send EC packets using (n,k) encoding
14
ECSRM • Combine suppression & erasure correction
• Assign each packet to an EC group of size k
• NACK: (group, # missing)
• NACK of (g,c) suppresses all (g,xc).
• Don’t re-send originals; send EC packets using (n,k) encoding
• Below, 1 NACK and one EC packet fixes all errors.1
2
3
4
5
6
7
EC1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
XXXX
XX XX XXXX
XX
15
Multicast PowerPoint Add-in
Slides
Annotations
Control informationECSRECSRMM
slide masterFcastFcast
16
Multicast PowerPoint - Late Joiners
• Viewers joining late don’t impact others with session persistent data (slide master)
timetime
joinjoin leaveleave
FcastFcast
ECSRECSRMM
joinjoin
17
Future Work
• Adding hierarchy (e.g. PGM by Cisco)
• Do we need 2 protocols?
18
Spatialized Teleconferences
• Map heads to “Eggs”
• Project voices in stereo using “nose vector”
19
RAGS: RAndom SQL test Generator
• Microsoft spends a LOT of money on testing. (60% of development according to one source).
• Idea: test SQL by » generating random correct queries» executing queries against database» compare results with SQL 6.5, DB2, Oracle, Sybase
• Being used in SQL 7.0 testing.» 375 unique bugs found (since 2/97)
» Very productive test tool
20
Sample Rags Generated Statement
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notesFROM titles T0, roysched T1WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY (
SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS (
SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 )
This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error:Query processor could not produce a query plan.
21
Automation
• Simpler Statement with same errorSELECT roysched.royalty FROM titles, royschedWHERE EXISTS (
SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1)
• Control statement attributes»complexity, kind, depth, ...
• Multi-user stress tests»tests concurrency, allocation, recovery
22
One 4-Vendor Rags Test3 of them vs Us
• 60 k Selects on MSS, DB2, Oracle, Sybase.
• 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements.
• Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new
• Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)
23
RAGS Next Steps
• Done:
»Patents, Papers, Talks
» tech transfer to development• SQL 7 (over 400 bugs), FoxPro, OLE DB.
• Next steps:
»Make even more automatic
»Extend to other parts of SQL and Tsql
»“Crawl” the config space (look for new holes)
»Apply ideas to other domains (ole db).
Scale Up and Scale Out
SMPSMPSuper ServerSuper Server
DepartmentalDepartmentalServerServer
PersonalPersonalSystemSystem
Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard
Grow Out with ClusterGrow Out with Cluster
Cluster has inexpensive partsCluster has inexpensive parts
Clusterof PCs
Billions Of Clients
• Every device will be “intelligent”
• Doors, rooms, cars…
• Computing will be ubiquitous
Billions Of ClientsNeed Millions Of Servers
MobileMobileclientsclients
FixedFixedclients clients
ServerServer
SuperSuperserverserver
ClientsClients
ServersServers
All clients networked All clients networked to serversto servers May be nomadicMay be nomadic
or on-demandor on-demand Fast clients wantFast clients want
fasterfaster servers servers Servers provide Servers provide
Shared DataShared Data ControlControl CoordinationCoordination CommunicationCommunication
ThesisMany little beat few big
Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?
$1 $1 millionmillion $100 K$100 K $10 K$10 K
MainframeMainframe MiniMiniMicroMicro NanoNano
14"14"9"9"
5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP
101066 clocks to bulk ram clocks to bulk ram
Event-horizon on chipEvent-horizon on chip
VM reincarnatedVM reincarnated
Multiprogram cache,Multiprogram cache,On-Chip SMPOn-Chip SMP
10 microsecond ram
10 millisecond disc
10 second tape archive
10 nano-second ram
Pico Processor
10 pico-second ram
1 MM 3
100 TB
1 TB
10 GB
1 MB
100 MB
28
Microsoft TerraServer: Scaleup to Big Databases
• Build a 1 TB SQL Server database• Data must be
» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere
• Loaded » 1.5 M place names from Encarta World Atlas» 3 M Sq Km from USGS (1 meter resolution)» 1 M Sq Km from Russian Space agency (2 m)
• On the web (world’s largest atlas)• Sell images with commerce server.
29
Microsoft TerraServer Background
• Earth is 500 Tera-meters square» USA is 10 tm2
• 100 TM2 land in 70ºN to 70ºS
• We have pictures of 6% of it» 3 tsm from USGS
» 2 tsm from Russian Space Agency
• Compress 5:1 (JPEG) to 1.5 TB.
• Slice into 10 KB chunks
• Store chunks in DB
• Navigate with
» Encarta™ Atlas• globe
• gazetteer
» StreetsPlus™ in the USA
40x60 km2 jump image
20x30 km2 browse image
10x15 km2 thumbnail
1.8x1.2 km2 tile
• Someday» multi-spectral image
» of everywhere
» once a day / hour
30
USGS Digital Ortho Quads (DOQ) • US Geologic Survey
• 4 Tera Bytes
• Most data not yet published
• Based on a CRADA» Microsoft TerraServer makes
data available.
USGS “DOQ”
1x1 meter4 TBContinentalUSNew DataComing
31
Russian Space Agency(SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor)
• 1.5 Meter Geo Rectified imagery of (almost) anywhere
• Almost equal-area projection
• De-classified satellite photos (from 200 KM),
• More data coming (1 m)
• Selling imagery on Internet.
• Putting 2 tm2 onto Microsoft TerraServer.
SPIN-2
32
Live on the internet 6/24/98For 18 Months
One Billion Served • New Since S-Day: • More data:
• 4.8 TB USGS DOQ• .8 TB Russian
• Bigger Server:• Alpha 8400
• 8 proc, 8 GB RAM, • 2.9 TB Disk
• Improved Application• Better UI• Uses ASP• Commerce App
• Load 6 TB more 60% US
4% world
• 30 M web hits per day peak
• 8 Mhpd avg (1 M page views /day)
• 1 Billion pages served!
• 99.95% available
• No NT failures, 30 minute SQL restart
33
http://www.TerraServer.Microsoft.com/
Demo
SPIN-2
Microsoft
BackOffice
34
Demo • navigate by coverage map to White House
• Download image
• buy imagery from USGS
• navigate by name to Venice
• buy SPIN2 image & Kodak photo
• Pop out to Expedia street map of Venice
• Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)
35
1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )
Hardware
100 MbpsEthernet Switch
DS3
SiteServersInternet
MapServer
SPIN-2
Web Servers
STK9710DLTTapeLibrary
489 GBDrives
AlphaServer8400
Enterprise Storage Array
8 x 440MHzAlpha cpus
10 GB DRAM
489 GBDrives
489 GBDrives
489 GBDrives
489 GBDrives
489 GBDrives
489 GBDrives
36
The Microsoft TerraServer Hardware
• Compaq AlphaServer 8400
• 8x400Mhz Alpha cpus
• 10 GB DRAM
• 324 9.2 GB StorageWorks Disks» 3 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)
• WindowsNT 4 EE, SQL Server 7.0
37
browser
HTMLJava
Viewer
The Internet
Web Client
Microsoft AutomapActiveX Server
Internet InfoServer 4.0
Image DeliveryApplication
SQL Server7
MicrosoftSite Server EE
Internet InformationServer 4.0
Image Provider Site(s)
TerraServer DB Automap Server
Terra-ServerStored Procedures
InternetInformationServer 4.0
ImageServer
Active Server Pages
MTS
TerraServer Web Site
Software
SQL Server 7
38
• Backup and Recovery
»STK 9710 Tape robot
»Legato NetWorker™
»SQL Server 7 Backup & Restore
»Clocked at 80 MBps (peak)(~ 200 GB/hr)
• SQL Server Enterprise Mgr
»DBA Maintenance
»SQL Performance Monitor
System Management & Maintenance
39
Microsoft TerraServer File Group Layout
• Convert 324 disks to 28 RAID5 setsplus 28 spare drives
• Make 4 WinNT volumes (RAID 50)
595 GB per volume
• Build 30 20GB files on each volume
• DB is File Group of 120 files
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
E: F: G: H:
HSZ70 A
HSZ70 B
40
Image Delivery and LoadIncremental load of 4 more TB in next 18 months
DLTTape “tar”
\Drop’N’ DoJobWait 4Load
LoadMgrDB
100mbitEtherSwitch
108 9.1 GBDrives
Enterprise Storage Array
AlphaServer8400
108 9.1 GBDrives
108 9.1 GBDrives
STKDLTTape
Library
604.3 GBDrives
AlphaServer4100
ESAAlphaServer4100
LoadMgr
DLTTape
NTBackup
ImgCutter
\Drop’N’ \Images
10: ImgCutter20: Partition30: ThumbImg40: BrowseImg45: JumpImg50: TileImg55: Meta Data60: Tile Meta70: Img Meta80: Update Place
...LoadMgr
41
Technical ChallengeKey idea
• Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server)
• Solution:Geo-spatial search key:
Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y)
Z-transform X & Y into single Z value, build B-tree on Z
Adjacent images stored next to each other
Search Method:Latitude and Longitude => X, Y, then Z
Select on matching Z value
42
Some Tera-Byte DatabasesKilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
• The Web: 1 TB of HTML
• TerraServer 1 TB of images
• Several other 1 TB (file) servers
• Hotmail: 7 TB of email
• Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked
• EOS/DIS (picture of planet each week)» 15 PB by 2007
• Federal Clearing house: images of checks» 15 PB by 2006 (7 year history)
• Nuclear Stockpile Stewardship Program» 10 Exabytes (???!!)
43
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
A novel A letter
Library of Library of Congress Congress (text)(text)
All Disks
All Tapes
A Movie
LoC (image)
All Photos
LoC (sound + cinima)
All Information!
44
Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html
• Soon everything can be recorded and kept
• Most data will never be seen by humans
• Precious Resource: Human attention Auto-SummarizationAuto-Search
will be a key enabling technology.
45
Scalability1 billion 1 billion
transactionstransactions
1.8 million 1.8 million mail messagesmail messages
4 terabytes of 4 terabytes of datadata
100 million100 millionweb hitsweb hits
• Scale up: to large SMP nodesScale up: to large SMP nodes• Scale out: to clusters of SMP nodesScale out: to clusters of SMP nodes
46
1.2 B tpd• 1 B tpd ran for 24 hrs.
• Out-of-the-box software
• Off-the-shelf hardware
• AMAZING!
•Sized for 30 days•Linear growth•5 micro-dollars per transaction
47
Millions of Transactions Per Day
0.1
1.
10.
100.
1,000.
1 Btpd Visa ATT BofA NYSE
Mtp
d
Millions of Transactions Per Day
0.100.200.300.400.500.600.700.800.900.
1,000.
1 Btpd Visa ATT BofA NYSE
Mtp
d
How Much Is 1 Billion Tpd?• 1 billion tpd = 11,574 tps
~ 700,000 tpm (transactions/minute)• ATT
» 185 million calls per peak day (worldwide)
• Visa ~20 million tpd» 400 million customers» 250K ATMs worldwide» 7 billion transactions
(card+cheque) in 1994
• New York Stock Exchange » 600,000 tpd
• Bank of America» 20 million tpd checks cleared
(more than any other bank)» 1.4 million tpd ATM transactions
• Worldwide Airlines Reservations: 250 Mtpd
48
NCSA Super Cluster
• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana
• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
49
NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE
• Scale UP an SMP: TerraServer
• Scale OUT with a cluster of machines
• Single-system image
»Naming
»Protection/security
»Management/load balance
• Fault tolerance
»“Wolfpack”
• Hot pluggable hardware & software
50
Web Web sitesite
DatabaseDatabase
Web site filesWeb site files
Database filesDatabase files
Server 1Server 1
BrowserBrowser
Symmetric Virtual Server Failover Example
Server 1Server 1 Server 2Server 2
Web site filesWeb site files
Database filesDatabase files
Web Web sitesite
DatabaseDatabase
Web Web sitesite
DatabaseDatabase
51
Clusters & BackOffice• Research: Instant & Transparent failover
• Making BackOffice PlugNPlay on Wolfpack
»Automatic install & configure
• Virtual Server concept makes it easy
»simpler management concept
»simpler context/state migration
»transparent to applications
• SQL 6.5E & 7.0 Failover
• MSMQ (queues), MTS (transactions).
52
Storage Latency: How Far Away is the Data?
Storage Latency: How Far Away is the Data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
109
106
This CampusThis Room
10 min
My Head 1 min
1.5 hrSacramento
2 YearsPluto
2,000 YearsAndromeda
53Controller
The Memory Hierarchy
• Measuring & Modeling Sequential IO
• Where is the bottleneck?
• How does it scale with
»SMP, RAID, new interconnects
Adapter SCSIFile cache PCI
MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)
Mem
bus
App address space
54
Sequential IO your mileage
will vary0.00
5.00
10.00
15.00
20.00
25.00
30.00
2 4 8 16 32 64 128
transfer size (KB)
MB
/sec
4 disk read
4 disk write
1 disk read
1 disk write
Striping HelpsController is bottneck
40 MB/sec Advertised UW SCSI
35r-23w MB/sec Actual disk transfer
29r-17w MB/sec 64 KB request (NTFS)
9 MB/sec Single disk media
3 MB/sec 2 KB request (SQL Server)
• Measuring hardware & Software
• Looking for software fixes..
• Aiming for “out of the box” 1/2 power point: 50% of peak power“out of the box”
0.00
2.00
4.00
6.00
8.00
10.00
2 4 8 16 32 64 128
transfer size (KB)
MB
/sec
1 disk read
1 disk write
1 disk read/(NTFS buffer)
1 disk write(NTFS buffer)
NTFS Read is good at 8KB, but writes are uniformly slow
55
PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point)
System Bus422 MBps
7.2 MB/s
133 MBps7.2 MB/s
10-15 MBps7.2 MB/s
SCSIFile System Buffers
ApplicationData
Disk
PCI
40 MBps7.2 MB/s
56
The Best Case: Temp File, NO IO• Temp file Read / Write File System Cache
• Program uses small (in cpu cache) buffer.
• So, write/read time is bus move time (3x better than copy)
• Paradox: fastest way to move data is to write then read it.
• This hardware islimited to 150 MBpsper processor
Temp File Read/Write
148 136
54
0
50
100
150
200
Temp read Temp write Memcopy ()
MB
ps
57
Bottleneck Analysis
• Drawn to linear scale
TheoreticalBus Bandwidth
422MBps = 66 Mhz x 64 bits
MemoryRead/Write
~150 MBps
MemCopy~50 MBps
Disk R/W~9MBps
58
3 Stripes and Your Out!• 3 disks can saturate adapter
• Similar story with UltraWide
• CPU time goes down with request size
• Ftdisk (striping is cheap)
Read Throughput vs Stripes - 3 deep Fast
0
5
10
15
20
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
WriteThroughput vs Stripes - 3 deep Fast
0
5
10
15
20
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
1 Disk
2 Disks
3 Disks
4 Disks
CPU miliseconds per MB
1
10
100
2 4 8 16 32 64 128 192
Request Size (bytes)
Co
st (
CP
U m
s/M
B)
=
59
Parallel SCSI Busses Help• Second SCSI bus nearly
doubles read and wce throughput
• Write needs deeper buffers
• Experiment is unbuffered(3-deep +WCE)
One or Two SCSI Busses
0
5
10
15
20
25
2 4 8 16 32 64 128 192
Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
ReadWriteWCEReadWriteWCE
2 busses
1 Bus
2 x
60
File System Buffering & Stripes(UltraWide Drives)
• FS buffering helps small reads
• FS buffered writes peak at 12MBps
• 3-deep async helps
• Write peaks at 20 MBps
• Read peaks at 30 MBps
Three Disks, 1 Deep
0
5
10
15
20
25
30
35
2 4 8 16 32 64 128 192Request Size (K Bytes)
Th
rou
gh
pu
t (M
B/s
)
FS Read
ReadFS Write WCE
Write WCE
Three Disks, 3 Deep
0
5
10
15
20
25
30
35
2 4 8 16 32 64 128 192Request Size (K Bytes)
Th
rou
gh
pu
t (M
B/s
)
61
PAP vs RAP• Reads are easy, writes are hard
• Async write can match WCE.
•
422 MBps
142 MBps
133 MBps
72 MBps
10-15 MBps
9 MBps
SCSI
File System
ApplicationData
PCI SCSI
Disks40 MBps
31 MBps
62
Bottleneck Analysis• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI
~ 65 MBps Unbuffered read~ 43 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Memory Read/Write ~150 MBps
PCI~70 MBps
Adapter~30 MBps
Adapter
70 M
Bps
63
Hypothetical Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI
(not measured, we had only one PCI bus available, 2nd one was “internal”)
~ 120 MBps Unbuffered read
~ 80 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Memory Read/Write ~150 MBps
PCI~70 MBps
Adapter~30 MBps
PCI
Adapter
Adapter
Adapter
120
MB
ps
64
Computers shrink to a point• Disks on track
• 100x in 10 years 2 TB 3.5” drive
• Shrink to 1” is 200GB
• Disk replaces tape?
• Disk is super computer!
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
65
Data Gravity Processing Moves to Transducers
• Move Processing to data sources
• Move to where the power (and sheet metal) is
• Processor in
»Modem
»Display
»Microphones (speech recognition) & cameras (vision)
»Storage: Data storage and analysis
66
It’s Already True of PrintersPeripheral = CyberBrick
• You buy a printer
• You get a
»several network interfaces
»A Postscript engine • cpu, • memory, • software,• a spooler (soon)
»and… a print engine.
67
Remember Your Roots
68
Year 2002 Disks• Big disk (10 $/GB)
» 3”
» 100 GB
» 150 kaps (k accesses per second)
» 20 MBps sequential
• Small disk (20 $/GB)» 3”
» 4 GB
» 100 kaps
» 10 MBps sequential
• Both running Windows NT™ 7.0?(see below for why)
69
How Do They Talk to Each Other?• Each node has an OS
• Each node has local resources: A federation.
• Each node does not completely trust the others.
• Nodes use RPC to talk to each other» CORBA? DCOM? IIOP? RMI?
» One or all of the above.
• Huge leverage in high-level interfaces.
• Same old distributed system story.
Wire(s)h
stre
ams
data
gram
s
RP
C?
Applications
VIAL/VIPL
streams
datagrams
RP
C ?
Applications
70
What if Networking Was as Cheap As Disk IO?
• TCP/IP
»Unix/NT 100% cpu @ 40MBps
• Disk
»Unix/NT 8% cpu @ 40MBps
Why the Difference?Host Bus Adapter does
SCSI packetizing, checksum,…flow controlDMA
Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers
71
Technology Drivers: The Promise of SAN/VIA:10x in 2 years
http://www.ViArch.org/• Today:
»wires are 10 MBps (100 Mbps Ethernet)
»~20 MBps tcp/ip saturates 2 cpus
»round-trip latency is ~300 us
• In the lab»Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,…
» Fast user-level communication• tcp/ip ~ 100 MBps 10% of each processor
• round-trip latency is 15 us
72
Gbps Ethernet: 110 MBps
SAN: Standard
Interconnect
PCI: 70 MBps
UW Scsi: 40 MBps
FW scsi: 20 MBps
scsi: 5 MBps
• LAN faster than memory bus?
• 1 GBps links in lab.
• 100$ port cost soon
• Port is computer
RIPFDDI
RIPATM
RIPSCI
RIPSCSI
RIPFC
RIP?
73
Technology Drivers
Plug & Play Software• RPC is standardizing: (DCOM, IIOP, HTTP)
» Gives huge TOOL LEVERAGE» Solves the hard problems for you:
• naming, • security, • directory service, • operations,...
• Commoditized programming environments » FreeBSD, Linix, Solaris,…+ tools» NetWare + tools» WinCE, WinNT,…+ tools» JavaOS + tools
• Apps gravitate to data.
• General purpose OS on controller runs apps.
74
Disk = Node• has magnetic storage (100 GB?)
• has processor & DRAM
• has SAN attachment
• has execution environment
OS KernelSAN driver Disk driver
File System RPC, ...Services DBMS
Applications
75
Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark
• How much can you sort for a penny.» Hardware and Software cost» Depreciated over 3 years» 1M$ system gets about 1 second,» 1K$ system gets about 1,000 seconds.» Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident
• Input is » 100-byte records (random data)» key is first 10 bytes.
• Must create output file and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
76
PennySort• Hardware
» 266 Mhz Intel PPro
» 64 MB SDRAM (10ns)
» Dual Fujitsu DMA 3.2GB EIDE
• Software» NT workstation 4.3
» NT 5 sort
• Performance» sort 15 M 100-byte records (~1.5 GB)
» Disk to disk
» elapsed time 820 sec • cpu time = 404 sec
PennySort Machine (1107$ )
board13%
Memory8%
Cabinet + Assembly
7%
Network, Video, floppy
9%
Software6%
Other22%
cpu 32%
Disk25%
77
Cluster Sort Conceptual Model
•Multiple Data Sources
•Multiple Data Destinations
•Multiple nodes
•Disks -> Sockets -> Disk -> DiskB
AAABBBCCC
A
AAABBBCCC
C
AAABBBCCC
BBBBBBBBB
AAAAAAAAA
CCCCCCCCC
BBBBBBBBB
AAAAAAAAA
CCCCCCCCC
78
Cluster Install & Execute
•If this is to be used by others, it must be:
•Easy to install•Easy to execute
• Installations of distributed systems take time and can be tedious. (AM2, GluGuard)
• Parallel Remote execution is non-trivial. (GLUnix, LSF)
How do we keep this “simple” and “built-in” to NTClusterSort ?
79
Remote Install
RegConnectRegistry()
RegCreateKeyEx()
•Add Registry entry to each remote node.
80
Cluster Execution
MULT_QI COSERVERINFO•Setup :
MULTI_QI structCOSERVERINFO struct
•CoCreateInstanceEx()
•Retrieve remote object handle from MULTI_QI struct
•Invoke methods as usual
HANDLEHANDLE
HANDLE
Sort()
Sort()
Sort()
81
Public Service• Gordon Bell
» Computer Museum
» Vanguard Group
» Edits column in CACM
• Jim Gray» National Research Council Computer Science and
Telecommunications Board
» Presidential Advisory Committee on NGI-IT-HPPC.
• Tom Barclay» USGS and Russian cooperative research
82
A Plug for CoRR
• CoRR = Computer Science Research Repository
• All computer science literature in cyberspace
• http://xxx.lanl.gov/archive/cs
• Endorsed by CACM
• Reviewed & Refereed EJournals will evolve from this archive
• PLEASE submit articles
• Copyright issues are still problematic
83
BARC
BARCMicrosoft Bay Area Research Center
Tom BarclayTyler Beam (U VA)*Gordon BellJoe BarreraJosh Coates (UCB)* Jim Gemmell Jim GraySteve LuccoErik Riedel (CMU)*Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen (NTFS)*
http://www.research.Microsoft.com/barc/