1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang,...
-
Upload
stephanie-mccabe -
Category
Documents
-
view
219 -
download
4
Transcript of 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang,...
![Page 1: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/1.jpg)
1
THsort PennySort
Award CeremonyBeijing China
19 October 2002Peng Liu, Yao Shi,
Li Zhang, Kuo Zhang, Tian Wang, |
ZunChong Tian, Hao Wang,
Xiaoge Wang Trophy presentation by Jim Gray
![Page 2: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/2.jpg)
2
Outline
• Penny Sort history and Award
• The need for long-range research
• Some long-range systems research goals.
• What I have been doing.
![Page 3: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/3.jpg)
3
Benchmark History
WisconsinBitton Boral DeWitt Turbyfill
IBM TP 1-7CA and Tony Lukes
Debit CreditGray
DatamationAnon et al
TPC-A
MCCBoral &...
TPC-B
TPC-C
1970
1980
1990
2000TPC-W ?
TeradataBollinger &...
TPC-D
Sort
PennySortMinuteSort
![Page 4: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/4.jpg)
4
A Short History of Sort• April Fools 1995: Datamation Sort
– Sort 1M 100 B records– An IO benchmark: 15-min to 1 hr!
• 1993: {Minute | Penny}x{Daytona | Indy}
• 1998: TeraByte Sort• Web site:
http://research.Microsoft.com/barc/SortBenchmark/
![Page 5: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/5.jpg)
5
Ground Rules
• How much can you sort for a penny (in a minute).– Hardware and Software cost– Depreciated over 3 years– 1M$ system gets about 1 second,– 1K$ system gets about 1,000 seconds.– Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident• Input is
– 100-byte records (random data)– key is first 10 bytes.
• Must create output file and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
![Page 6: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/6.jpg)
6
PennySort• Hardware
– 266 Mhz Intel PPro– 64 MB SDRAM (10ns)– Dual Fujitsu DMA 3.2GB EIDE disks
• Software– NT workstation 4.3– NT 5 sort
• Performance– sort 15 M 100-byte records (~1.5 GB)
– Disk to disk– elapsed time 820 sec
• cpu time = 404 sec
PennySort Machine (1107$ )
board13%
Memory8%
Cabinet + Assembly
7%
Network, Video, floppy
9%
Software6%
Other22%
cpu 32%
Disk25%
![Page 7: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/7.jpg)
7
1999 PennySort
• Daytona & Indy: 2.58 GB in 917 sec
• HMsort: Brad Helmkamp, Keith McCready, Stenograph LLC
• Intel 400Mhz2 IDE disks
![Page 8: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/8.jpg)
8
1998 TB Sort
• Chris NybergNsortSGI 32x Origin2000151 Minutes
![Page 9: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/9.jpg)
9
1999 Terabyte Sort• Daytona:
Daivd Cossock, Sam Fineberg,Pankaj Mehra, John PeckTandem/Sandia TSort: 68 CPU ServerNet47 minutes
• Indy: IBM SPsort
408 nodes, 1952 cpu 2168 disks
17.6 minutes = 1057sec(all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect
![Page 10: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/10.jpg)
10
SP sort• 2 – 4 GBps!
432
node
s37
rac
ksco
mpu
te
488 nodes 55 racks1952 processors, 732 GB RAM, 2168 disks
56 n
odes
18 r
acks
Stor
age
Compute rack:16 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM1 32x33 PCI bus9 GB scsi disk150MBps full duplex SP switch
Storage rack:8 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM3 32x33 PCI bus30x4 GB scsi disk (4+1 RAID5)150MBps full duplex SP switch
56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node)
432
node
s37
rac
ksco
mpu
te
488 nodes 55 racks1952 processors, 732 GB RAM, 2168 disks
56 n
odes
18 r
acks
Stor
age
Compute rack:16 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM1 32x33 PCI bus9 GB scsi disk150MBps full duplex SP switch
Storage rack:8 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM3 32x33 PCI bus30x4 GB scsi disk (4+1 RAID5)150MBps full duplex SP switch
56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 100 200 300 400 500 600 700 800 900
Elapsed time (seconds)G
B/s
GPFS read
GPFS write
Local read
Local write
![Page 11: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/11.jpg)
11
1999 Sort Records
2002 Sort Records Daytona Indy
Penny 9.8 GB 1098 seconds
105 million records $857 Linux/Intel THsort, report as doc (128KB) or pdf (33KB)
Peng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, ZunChong Tian, Hao Wang,
Xiaoge WangHigh Performance Institute,
Dept. of Computer Science and Technology, Tsinghua University, Beijing 100084, China
11.6 GB 1380 seconds 125 m records on a $672 Linux/Intel system
DMsort pdf (660KB), ps(950KB) Araron Darling, Alex Mohr,
U. Wisconsin, Madison
Minute 12 GB in 60 secondsOrdinal Nsort
SGI 32 cpu Origin IRIX
21.8 GB in 56.51 sec218 million records
NOW+HPVMsort 64 nodes WinNT pdf. Luis Rivera , Andrew Chien UCSD
TeraByte 49 minutes Daivd Cossock, Sam Fineberg,
Pankaj Mehra, John Peck68x2 Compaq &Sandia Labs
1057 secondsSPsort 1952 SP cluster 2168 disks
Jm Wyllie PDF SPsort.pdf (80KB)
![Page 12: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/12.jpg)
12
The THsort Team(and friend)
![Page 13: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/13.jpg)
13
• Partly hardware
• Partly software
• Partly economics
1.E-03
1.E+00
1.E+03
1.E+06
1985 1990 1995 2000
Records Sorted per SecondDoubles Every Year
GB Sorted per DollarDoubles Every Year
2x/year!
THsort ~ 1TB/$
![Page 14: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/14.jpg)
14
Progress on Sorting
• Speedup comes from Moore’s law 40%/year• Processor/Disk/Network arrays: 60%/year
(this is a software speedup).
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1985 1990 1995 2000
Ordinal+SGI
Sort Records/second vs Time
Bitton M68000
Cray YMP
IBM 3090
Tandem
Kitsuregawa Hardware Sorter
Sequent Intel HyperCube
IBM RS6000NOW
Alpha
PennyNTsort
Sandia/Compaq/NT
SPsort/IB
1.E-03
1.E+00
1.E+03
1.E+06
1985 1990 1995 2000
Records Sorted per SecondDoubles Every Year
GB Sorted per DollarDoubles Every Year
Compaq/NT NT/PennySort
SPsort
THsort~1TB/$
![Page 15: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/15.jpg)
15
Musings: PennySort=TBsort
• Sorts 1TB in 1Minute
• 2 pass so 3TB of disk
• = 10 disks if 330GB/disk
• = 5Gps (if each disk is 50Mbps)
• So, 600 seconds (3TB/5GBps)
• So, node costs 1.5k$
• Costs 100x that today
• maybe in 4 years?
![Page 16: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/16.jpg)
16
Outline
• Penny Sort history and Award
• The need for long-range research
• Some long-range systems research goals.
• What I have been doing.
![Page 17: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/17.jpg)
17
Properties of a Research Goal
• Simple to state.
• Not obvious how to do it.
• Clear benefit.
• Can be broken into smaller steps– So that you can see intermediate progress.
• Progress and solution is testable.
![Page 18: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/18.jpg)
18
I was motivated by a simple goal1. Devise an architecture that scales up:
Grow the system without limits*.This is impossible (without limits?), but...This meant automatic parallelism,
automatic management,distributed,fault tolerant,high performance
• Benefits: – long term vision guides research problems– simple to state, so attracts colleagues and support– Can tell your friends & family what it is that you do .
scaleup: 1,000,000 : 1
![Page 19: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/19.jpg)
19
Three Seminal Papers• Babbage: Computers• Bush: Automatic Information storage & access• Turing: Intelligent Machines
• Note: – Previous Turing lectures
described several “theory” problems.– Problems here are “systems” problems.– Some include a “and prove it” clause.– They are enabling technologies, not applications.– Newell’s: Intelligent Universe (Ubiquitous computing.)
missing because I could not find “simple-to-state” problems.
![Page 20: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/20.jpg)
20
Charles Babbage (1791-1871)
• Babbage’s computing goals have been realized– But we still need better algorithms & faster machines
• What happens when – Computers are free and infinitely powerful?– Bandwidth and storage is free and infinite?
• Remaining limits:– Content: the core asset of cyberspace– Software: Bugs, >100$ per line of code (!) – Operations: > 1,000 $/node/year
![Page 21: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/21.jpg)
21
ops/s/$ Had Three Growth Curves 1890-1990
1890-1945Mechanical
Relay
7-year doubling
1945-1985Tube, transistor,..
2.3 year doubling
1985-2000Microprocessor
1.0 year doubling1.E-06
1.E-03
1.E+00
1.E+03
1.E+06
1.E+09
1880 1900 1920 1940 1960 1980 2000
doubles every 7.5 years
doubles every 2.3 years
doubles every 1.0 years
ops per second/$
Combination of Hans Moravac + Larry Roberts + Gordon Bell WordSize*ops/s/sysprice
![Page 22: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/22.jpg)
22
Trouble-Free Appliances • Appliance just works. TV, PDA, desktop, ...• State replicated in safe place (somewhere else)• If hardware fails, or is lost or stolen,
replacement arrives next day (plug&play).• If software faults,
software and state refresh from server.• If you buy a new appliance, it plugs in and refreshes
from the server (as though the old one failed)• Most vendors are building towards this vision.• Browsers come close to working this way.
![Page 23: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/23.jpg)
23
Trouble-Free Systems
• Manager – Sets goals– Sets policy– Sets budget– System does the rest.
• Everyone is a CIO (Chief Information Officer)
9. Build a system – used by millions of people each day– Administered and managed by a ½ time person.
• On hardware fault, order replacement part• On overload, order additional equipment• Upgrade hardware and software automatically.
![Page 24: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/24.jpg)
24
Trustworthy Systems
• Build a system used by millions of people that 10. Only services authorized users
• Service cannot be denied (can’t destroy data or power).
• Information cannot be stolen.
11. Is always available: (out less than 1 second per 100 years = 8 9’s of availability) • 1950’s 90% availability,
Today 99% uptime for web sites, 99.99% for well managed sites (50 minutes/year)3 extra 9s in 45 years.
• Goal: 5 more 9s: 1 second per century.– And prove it.
![Page 25: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/25.jpg)
25
100 $ line of code?1 bug per thousand lines?
• 20 $ to design and write it.• 30 $ to test and document it.• 50 $ to maintain it. 100$ total
• The only thing in Cyber Space that is getting MORE expensive & LESS reliable
• Application generators:
Web sites, Databases, ...
• Semi-custom apps:
SAP, PeopleSoft,..
• Scripting & Objects
JavaScript & DOM
Solution so far:
• Write fewer lines High level languages
• Non Procedural
•10x not 1,000x better Very domain specific
![Page 26: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/26.jpg)
26
Automatic Programming Do What I Mean (not 100$ Line of code!, no programming bugs)
The holy grail of programming languages & systems
12. Devise a specification language or UI 1. That is easy for people to express designs (1,000x easier),
2. That computers can compile, and
3. That can describe all applications (is complete).
• System should “reason” about application– Ask about exception cases.
– Ask about incomplete specification.
– But not be onerous.
• This already exists in domain-specific areas. (i.e. 2 out of 3 already exists)
• An imitation game for a programming staff.
![Page 27: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/27.jpg)
27
Outline
• Penny Sort history and Award
• The need for long-range research
• Some long-range systems research goals.
• What I have been doing.
![Page 28: 1 THsort PennySort Award Ceremony Beijing China 19 October 2002 Peng LiuPeng Liu, Yao Shi, Li Zhang, Kuo Zhang, Tian Wang, | ZunChong Tian, Hao Wang, Xiaoge.](https://reader036.fdocuments.in/reader036/viewer/2022082604/55147686550346b2598b45a0/html5/thumbnails/28.jpg)
28
What I Have Been Doing
• Traveling & Talking
• Helping Alex Build the SkyServer
• Loading data
• Helping build the Virtual Observatory
• Doing spatial geometry in SQL (no kidding)!
• Learning about web services (and implementing some)