NSF Visit Gordon Bell gbell Microsoft Research 4 October 2002.
David Culler, NSF Site Visit March 5, 2003
description
Transcript of David Culler, NSF Site Visit March 5, 2003
SimMillennium and Beyond
From “Computer Systems, Computational Science and Engineering
in the Large” to “petabyte stores”
David Culler,
NSF Site VisitMarch 5, 2003
Millennium 2
SimMillennium Project Goals
• Vision: To work, think, and study in a computationally rich environment with deep information stores and powerful services
• Enable major advances in Computational Science and Engineering
– Simulation, Modeling, and Information Processing becoming ubiquitous
• Explore novel design techniques for large, complex systems– Fundamental Computer Science problems ahead are problems of scale
– Organized in concert with Univ. structure => computational economy
• Develop fundamentally better ways of assimilating and interacting with large volumes of information and with each other
• Explore emerging technologies– networking, OS, devices
Millennium 3
Research Infrastructure We Built
• Cluster of Clusters (CLUMPS) distributed over multiple departments
– gigabit ethernet within and between
– Myrinet High speed interconnect
• Vineyard Cluster System Architecture– Rootstock remote cluster installation tools
– Ganglia remote cluster monitoring
– GEXEC remote execution, GM (Myricom) messaging, MPI
– PCP – parallel file tools
– collection of port daemons, tools to make it all hand together
• Gigabit to desktop, immersadesk, ...
Millennium 4
169.229.51.230
Cisco-6500
1/210001002
Cisco-6500
MillenniumClusterSoda
Millennium Gen3 Network TopologyIn a perfect world ....
Backbone
AstrophysicsClusterCampbell
MathClusterEvans
EECSClusterCory
202UCB campus core
GigE DesktopSodasoda442-xlr
soda498-xlr soda598-xlr
soda798-xlr(soda542-xlr)
soda698-xlr
OceanstoreClusterSoda
1200
1200 1100
1100
1200
Millennium | Clustered Com puting Research Group | Univers ity of California, Berke ley | 15 Nov 02
BigIron8k
1100
1100
1100
FutureClusters
BigIron8k
201
FastIron1500
FastIron1500
soda-bb
evans-bb
Future WAN/CITRISMillennium backbone providesplug and play support for:10 GigE LAN/WAN PHYOC-3,12,48 POS
m ath-gw
astro-gw
ocean-gw 1
eecs-gw
m il-gw
citris -gwNetw orks under
Millenniummanagement
Netw orks notunder Millennium
management
Key:Primary linkSecondary linkAll links are 1000Mbps
CITRISClusterSoda
BigIron4k
BigIron4k
ocean-gw 2
AdministrativeClusterSoda
adm in-gw
???
NPACIRocksCluster
PlanetLabCluster
CITRISPilotCluster
NOW
Millennium 5
Cluster Counts
• Millennium Central Cluster– 99 Dell 2300/6400/6450 Xeon Dual/Quad: 336 processors– Total: 238 GB memory, 2 TB disk– Myrinet 2000 + 1000Mb fiber ethernet
• Millennium Campus Clusters (Astro, Math, CE, EE, Physics, Bio)– 176 proc, 34 GB mem, 1.2 TB local disk– total: 512 proc, 292 GB mem, 3.2 TB scratch
• NPACI ROCKS Cluster– 8 proc, 2 GB mem, 36 GB
• OceanStore/ROC cluster• PlanetLab Cluster
– 6 prc, 1.32 GHz, 3 GB mem, 180 GB
• CITRIS Cluster 1: 3/2002 deployment (Intel Donation)– 4 Dell Precision 730 Itanium Duals: 8 processors– Total: 8GB memory, 128GB disk– Myrinet 2000 + 1000Mb copper ethernet (SimMil)
• CITRIS Cluster 2: deployment (Intel Donation)– ~128 Dell McKinley class Duals: 256 processors
» 16x2 installed– Total: ~512GB memory, ~8TB disk– Myrinet 2000 + 1000Mb copper ethernet (SimMil)
• Many phasing out– NOW, Ninja, Dig Lab. ...
Millennium 6
Cluster Top Users 2/2003
• ~800 users total on central cluster• 84 major users for 2/2003: average 62% total CPU utilization
– ROC – middle tier storage layer testing/performance (bling,ach,fox@stanford)– Computer Vision Group – image recognition, boundary detection and
segmentation, data mining (aberg,lwalk,dmartin,ryanw, xren) “2 hours on cluster vs. 2 weeks on local resources”
– Computational Biology Lab - large-scale biological sequence database searches in parallel (brenner@compbio)
– Tempest - TCAD tools for Next Generation Lithography (yunfei)– Internet services – performance characteristics of multithreaded servers
(jrvb,jcondit)– Sensor Networks – power reduction (vwen)– Economic modeling – (stanton@haas)– Machine learning – information retrieval, text processing (blei)– Analyzing trends in BGP routing tables (sagarwal, mccaesar)– Graphics - Optical simulation and high quality rendering (adamb, csh)– Digital Library Project – image retreival by image content (loretta)– Bottleneck Analysis of Fine-grain Parallelism – (bfields)– SPUR – Earthquake simulation (jspark@ce)– Titanium – compiler and runtime system design for high performance parallel
programming languages (bonachea)– AMANDA – neutrino detection from polar ice core samples (amanda)
http://ganglia.millennium.berkeley.edu
Millennium 7
Impact
• Numerous groups doing research they could not have done without it
– Malik photorealistic rendering, physics simulation,..– Yelick, Titanium, Heart Modeling, ...– Wilensky, Digital Library, image segmentation– Brewer, Culler, Ninja Internet Service Arch...– Price, AMANDA, ...– Kubiatowicz, OceanStore, Katz, Sahara, Hellerstein PIER
• First eScience Portals– Tempest, EUV lithography, Sugar MEMS simulation services
• safe.millennium.berkeley.edu on Sept 11– built w/i hours, scaled to million hits per day
• CS267 – core of MS of computation science X• Cluster tools widely adopted
– NPACI ROCKS– Ganglia the most downloaded cluster tool, in all the distributions,
OSCAR, open source development team
Millennium 8
Computational Economy
• Developed economic-based resource allocation– decentralized design
– interactive and batch
• Advanced the SOA– controlled experiments with priced and unpriced clusters
– analysis of utility gain relative to traditional resource allocation algorithms
• Picked up in several other areas– index – pricing internet bandwidth
– iceberg – pricing in telco/internet merge
– core to internet design for planetary scale services
Millennium 9
Emergence of Planetary-Scale Services
• In past year Millennium became THE simulation engine for P2P
– oceanstore, I^3, Sahara, BGP alternatives, PIER
• Ganglia was the technical enabler for planetlab– > 100 machines at > 50 sites in > 8 countries
– THE testbed for internet-scale systems research
Millennium 10
Fundamental Bottleneck: Storage
• Current storage hierarchy– based on NPACI reference
– 3 TB local /scratch and /net/MMxx/scratch 4-day deletion
– 0.5 TB global NFS /work 9-day deletion
» inadequate BW and capacity
– ~4 TB /home and /project
» uniform naming through automount
» doesn’t scale to cluster access
• => augment capacity, BW, and metadata BW
• we’ve been tracking cluster storage options since xFS on NOW and Tertiary Disk in 1995.
Millennium 11
Another Cluster – a storage cluster
Millennium Clusters
Citris Clusters
Massive StorageClusters
Scalable GigECore
Myrinet SA
NDesigned for higher reliability
Avoid competition from on-going computation
Local disks heavily used as scratch
Millennium 12
Foundry8000
1TFlop 1.6TB memory128 Dual Itanium 2
Compute Nodes
4 Storage Controller2 MetaServers
3.5TB Fibre ChannelStorage
Myrinet2000
Foundry8000
Foundry1500
CampusCore
128
6
128
4
1 Gigabit Ethernet
Myrinet
Fibre Channel
2 Frontend Nodes2
2
6
Initial Cluster Design with 3.5TB Distributed File Store
Millennium 13
Storage Controller
864GBStorage Controller
864GBStorage Controller
864GBStorage Controller
864GB
= 36GB 15K rpm = fibre channel = gbit ethernet
Meta Server Meta Server
Initial 3.5 TB Cluster Data Store
= myrinet
BlueARC si8300 with 24 36GB 15K rpm disks and growth room
Millennium 14
Lustre: A High-Performance, Scalable, Distributed File System for Clusters and Shared-Data Environments
• Progress since xFS– TruCluster, GPFS, pvfs, ...
– need “production quality”
– NAS is finally here
• History: CMU, Seagate, Los Alamos, Sandia, TriLabs
• Distributed Filesystem replacing NFS
• Object based file storage– object like inode represents a file
• Opensource development managed by Cluster File Systems, Inc.
• Gaining wide acceptance for production high-performance computing
– PNNL and LLNL
– Los Alamos and Sandia Labs
– HP support as part of linux cluster effort
– Intel Enterprise Architecture Lab
Millennium 15
Lustre: Key Advantages
• Open protocols, standards: Portals API, XML, LDAP
• Runs on commodity PC hardware + 3rd party OST– such as BlueArc
• Uses commodity filesystems on OSTs – such as ext3, JFS ReiserFS and XFS
• Scalable and efficient design splits– (qty 2) Metadata servers: storing file system metadata
– (up to 100) Object storage targets: storing files
– To support up to 2000+ clients
• Flexible model for adding new storage to existing Lustre file system.
• Metadata server failover
Millennium 16
Meta Servers(Meta Data Servers)
Clients
Storage Controllers(Object Storage Targets)
system and parallelfile I/O, file locking
directory metadataand concurrency
recovery,file status,
file creation
Lustre: Functionality
Millennium 17
Growth Plan
• based on conservative 50% per year density– expect roughly double
y03 y04 y05 y06 y07
3.5 TB4 SS2 MS
8 TB6 SS3 MS
14 TB8 SS3 MS
23 TB8 SS3 MS
35 TB8 SS3 MS
Millennium 18
Example Projects
• Cluster monitoring trace– ¼ TB per year for 300 nodes
• ROC failure data– ¼ TB per year, much higher if get industrial feeds
• Digital Library
• Video– 100 GB/hour uncompressed
• Vision– 100 GB per experiement
• PlanetLab– internet wide instrumentation and logging
We will look back and say,
“we are doing research today that
we could not have done without
this”
Millennium 19
End of the Tape Era
Aug, 1999 NSF RI 99 18
Massive Cheap Storage
•Basic unit:
2 PCs double-ending four SCSI chains
Currently serving Fine Art at http://www.thinker.org/imagebase/
log $/GB
year
disk
tape
2001
Millennium 20
Emergence of the Sensor Net Era
• 100s of research groups and companies using the Berkeley Mote / TinyOS platform
• dozens of projects on campus
• billions of networked devices connected to the physical world – constantly streaming data
• => start building the storage and processing infrastructure for this new class of system today!
Millennium 21
Environment Monitoring Experience
• Canonical “patch” net architecture
• live & historical readings www.greatduckisland.net
• 43 nodes, 7/13-11/18
• above and below ground
• light, temperature, relative humidity, and occupancy data, at 1 minute resolution
• >1 million measurements– Best nodes ~90,000
• 3 major maintenance events
• node design and packaging in harsh environment
– -20 – 100 degrees, rain, wind
• power mgmt and interplay with sensors and environment
Basestation
Gateway
Sensor Patch
Patch Network
Base-Remote Link
Data Service
Internet
Client Data Browsingand Processing
Sensor Node
Transit Network
Millennium 22
Sample ResultsNode Lifetime and Utility
Effective Communication Phase
Packet Loss
Correlation