David Culler, NSF Site Visit March 5, 2003

22
SimMillennium and Beyond From “Computer Systems, Computational Science and Engineering in the Large” to “petabyte stores” David Culler, NSF Site Visit March 5, 2003

description

SimMillennium and Beyond From “Computer Systems, Computational Science and Engineering in the Large” to “petabyte stores”. David Culler, NSF Site Visit March 5, 2003. SimMillennium Project Goals. - PowerPoint PPT Presentation

Transcript of David Culler, NSF Site Visit March 5, 2003

Page 1: David Culler, NSF Site Visit March 5, 2003

SimMillennium and Beyond

From “Computer Systems, Computational Science and Engineering

in the Large” to “petabyte stores”

David Culler,

NSF Site VisitMarch 5, 2003

Page 2: David Culler, NSF Site Visit March 5, 2003

Millennium 2

SimMillennium Project Goals

• Vision: To work, think, and study in a computationally rich environment with deep information stores and powerful services

• Enable major advances in Computational Science and Engineering

– Simulation, Modeling, and Information Processing becoming ubiquitous

• Explore novel design techniques for large, complex systems– Fundamental Computer Science problems ahead are problems of scale

– Organized in concert with Univ. structure => computational economy

• Develop fundamentally better ways of assimilating and interacting with large volumes of information and with each other

• Explore emerging technologies– networking, OS, devices

Page 3: David Culler, NSF Site Visit March 5, 2003

Millennium 3

Research Infrastructure We Built

• Cluster of Clusters (CLUMPS) distributed over multiple departments

– gigabit ethernet within and between

– Myrinet High speed interconnect

• Vineyard Cluster System Architecture– Rootstock remote cluster installation tools

– Ganglia remote cluster monitoring

– GEXEC remote execution, GM (Myricom) messaging, MPI

– PCP – parallel file tools

– collection of port daemons, tools to make it all hand together

• Gigabit to desktop, immersadesk, ...

Page 4: David Culler, NSF Site Visit March 5, 2003

Millennium 4

169.229.51.230

Cisco-6500

1/210001002

Cisco-6500

MillenniumClusterSoda

Millennium Gen3 Network TopologyIn a perfect world ....

Backbone

AstrophysicsClusterCampbell

MathClusterEvans

EECSClusterCory

202UCB campus core

GigE DesktopSodasoda442-xlr

soda498-xlr soda598-xlr

soda798-xlr(soda542-xlr)

soda698-xlr

OceanstoreClusterSoda

1200

1200 1100

1100

1200

Millennium | Clustered Com puting Research Group | Univers ity of California, Berke ley | 15 Nov 02

BigIron8k

1100

1100

1100

FutureClusters

BigIron8k

201

FastIron1500

FastIron1500

soda-bb

evans-bb

Future WAN/CITRISMillennium backbone providesplug and play support for:10 GigE LAN/WAN PHYOC-3,12,48 POS

m ath-gw

astro-gw

ocean-gw 1

eecs-gw

m il-gw

citris -gwNetw orks under

Millenniummanagement

Netw orks notunder Millennium

management

Key:Primary linkSecondary linkAll links are 1000Mbps

CITRISClusterSoda

BigIron4k

BigIron4k

ocean-gw 2

AdministrativeClusterSoda

adm in-gw

???

NPACIRocksCluster

PlanetLabCluster

CITRISPilotCluster

NOW

Page 5: David Culler, NSF Site Visit March 5, 2003

Millennium 5

Cluster Counts

• Millennium Central Cluster– 99 Dell 2300/6400/6450 Xeon Dual/Quad: 336 processors– Total: 238 GB memory, 2 TB disk– Myrinet 2000 + 1000Mb fiber ethernet

• Millennium Campus Clusters (Astro, Math, CE, EE, Physics, Bio)– 176 proc, 34 GB mem, 1.2 TB local disk– total: 512 proc, 292 GB mem, 3.2 TB scratch

• NPACI ROCKS Cluster– 8 proc, 2 GB mem, 36 GB

• OceanStore/ROC cluster• PlanetLab Cluster

– 6 prc, 1.32 GHz, 3 GB mem, 180 GB

• CITRIS Cluster 1: 3/2002 deployment (Intel Donation)– 4 Dell Precision 730 Itanium Duals: 8 processors– Total: 8GB memory, 128GB disk– Myrinet 2000 + 1000Mb copper ethernet (SimMil)

• CITRIS Cluster 2: deployment (Intel Donation)– ~128 Dell McKinley class Duals: 256 processors

» 16x2 installed– Total: ~512GB memory, ~8TB disk– Myrinet 2000 + 1000Mb copper ethernet (SimMil)

• Many phasing out– NOW, Ninja, Dig Lab. ...

Page 6: David Culler, NSF Site Visit March 5, 2003

Millennium 6

Cluster Top Users 2/2003

• ~800 users total on central cluster• 84 major users for 2/2003: average 62% total CPU utilization

– ROC – middle tier storage layer testing/performance (bling,ach,fox@stanford)– Computer Vision Group – image recognition, boundary detection and

segmentation, data mining (aberg,lwalk,dmartin,ryanw, xren) “2 hours on cluster vs. 2 weeks on local resources”

– Computational Biology Lab - large-scale biological sequence database searches in parallel (brenner@compbio)

– Tempest - TCAD tools for Next Generation Lithography (yunfei)– Internet services – performance characteristics of multithreaded servers

(jrvb,jcondit)– Sensor Networks – power reduction (vwen)– Economic modeling – (stanton@haas)– Machine learning – information retrieval, text processing (blei)– Analyzing trends in BGP routing tables (sagarwal, mccaesar)– Graphics - Optical simulation and high quality rendering (adamb, csh)– Digital Library Project – image retreival by image content (loretta)– Bottleneck Analysis of Fine-grain Parallelism – (bfields)– SPUR – Earthquake simulation (jspark@ce)– Titanium – compiler and runtime system design for high performance parallel

programming languages (bonachea)– AMANDA – neutrino detection from polar ice core samples (amanda)

http://ganglia.millennium.berkeley.edu

Page 7: David Culler, NSF Site Visit March 5, 2003

Millennium 7

Impact

• Numerous groups doing research they could not have done without it

– Malik photorealistic rendering, physics simulation,..– Yelick, Titanium, Heart Modeling, ...– Wilensky, Digital Library, image segmentation– Brewer, Culler, Ninja Internet Service Arch...– Price, AMANDA, ...– Kubiatowicz, OceanStore, Katz, Sahara, Hellerstein PIER

• First eScience Portals– Tempest, EUV lithography, Sugar MEMS simulation services

• safe.millennium.berkeley.edu on Sept 11– built w/i hours, scaled to million hits per day

• CS267 – core of MS of computation science X• Cluster tools widely adopted

– NPACI ROCKS– Ganglia the most downloaded cluster tool, in all the distributions,

OSCAR, open source development team

Page 8: David Culler, NSF Site Visit March 5, 2003

Millennium 8

Computational Economy

• Developed economic-based resource allocation– decentralized design

– interactive and batch

• Advanced the SOA– controlled experiments with priced and unpriced clusters

– analysis of utility gain relative to traditional resource allocation algorithms

• Picked up in several other areas– index – pricing internet bandwidth

– iceberg – pricing in telco/internet merge

– core to internet design for planetary scale services

Page 9: David Culler, NSF Site Visit March 5, 2003

Millennium 9

Emergence of Planetary-Scale Services

• In past year Millennium became THE simulation engine for P2P

– oceanstore, I^3, Sahara, BGP alternatives, PIER

• Ganglia was the technical enabler for planetlab– > 100 machines at > 50 sites in > 8 countries

– THE testbed for internet-scale systems research

Page 10: David Culler, NSF Site Visit March 5, 2003

Millennium 10

Fundamental Bottleneck: Storage

• Current storage hierarchy– based on NPACI reference

– 3 TB local /scratch and /net/MMxx/scratch 4-day deletion

– 0.5 TB global NFS /work 9-day deletion

» inadequate BW and capacity

– ~4 TB /home and /project

» uniform naming through automount

» doesn’t scale to cluster access

• => augment capacity, BW, and metadata BW

• we’ve been tracking cluster storage options since xFS on NOW and Tertiary Disk in 1995.

Page 11: David Culler, NSF Site Visit March 5, 2003

Millennium 11

Another Cluster – a storage cluster

Millennium Clusters

Citris Clusters

Massive StorageClusters

Scalable GigECore

Myrinet SA

NDesigned for higher reliability

Avoid competition from on-going computation

Local disks heavily used as scratch

Page 12: David Culler, NSF Site Visit March 5, 2003

Millennium 12

Foundry8000

1TFlop 1.6TB memory128 Dual Itanium 2

Compute Nodes

4 Storage Controller2 MetaServers

3.5TB Fibre ChannelStorage

Myrinet2000

Foundry8000

Foundry1500

CampusCore

128

6

128

4

1 Gigabit Ethernet

Myrinet

Fibre Channel

2 Frontend Nodes2

2

6

Initial Cluster Design with 3.5TB Distributed File Store

Page 13: David Culler, NSF Site Visit March 5, 2003

Millennium 13

Storage Controller

864GBStorage Controller

864GBStorage Controller

864GBStorage Controller

864GB

= 36GB 15K rpm = fibre channel = gbit ethernet

Meta Server Meta Server

Initial 3.5 TB Cluster Data Store

= myrinet

BlueARC si8300 with 24 36GB 15K rpm disks and growth room

Page 14: David Culler, NSF Site Visit March 5, 2003

Millennium 14

Lustre: A High-Performance, Scalable, Distributed File System for Clusters and Shared-Data Environments

• Progress since xFS– TruCluster, GPFS, pvfs, ...

– need “production quality”

– NAS is finally here

• History: CMU, Seagate, Los Alamos, Sandia, TriLabs

• Distributed Filesystem replacing NFS

• Object based file storage– object like inode represents a file

• Opensource development managed by Cluster File Systems, Inc.

• Gaining wide acceptance for production high-performance computing

– PNNL and LLNL

– Los Alamos and Sandia Labs

– HP support as part of linux cluster effort

– Intel Enterprise Architecture Lab

Page 15: David Culler, NSF Site Visit March 5, 2003

Millennium 15

Lustre: Key Advantages

• Open protocols, standards: Portals API, XML, LDAP

• Runs on commodity PC hardware + 3rd party OST– such as BlueArc

• Uses commodity filesystems on OSTs – such as ext3, JFS ReiserFS and XFS

• Scalable and efficient design splits– (qty 2) Metadata servers: storing file system metadata

– (up to 100) Object storage targets: storing files

– To support up to 2000+ clients

• Flexible model for adding new storage to existing Lustre file system.

• Metadata server failover

Page 16: David Culler, NSF Site Visit March 5, 2003

Millennium 16

Meta Servers(Meta Data Servers)

Clients

Storage Controllers(Object Storage Targets)

system and parallelfile I/O, file locking

directory metadataand concurrency

recovery,file status,

file creation

Lustre: Functionality

Page 17: David Culler, NSF Site Visit March 5, 2003

Millennium 17

Growth Plan

• based on conservative 50% per year density– expect roughly double

y03 y04 y05 y06 y07

3.5 TB4 SS2 MS

8 TB6 SS3 MS

14 TB8 SS3 MS

23 TB8 SS3 MS

35 TB8 SS3 MS

Page 18: David Culler, NSF Site Visit March 5, 2003

Millennium 18

Example Projects

• Cluster monitoring trace– ¼ TB per year for 300 nodes

• ROC failure data– ¼ TB per year, much higher if get industrial feeds

• Digital Library

• Video– 100 GB/hour uncompressed

• Vision– 100 GB per experiement

• PlanetLab– internet wide instrumentation and logging

We will look back and say,

“we are doing research today that

we could not have done without

this”

Page 19: David Culler, NSF Site Visit March 5, 2003

Millennium 19

End of the Tape Era

Aug, 1999 NSF RI 99 18

Massive Cheap Storage

•Basic unit:

2 PCs double-ending four SCSI chains

Currently serving Fine Art at http://www.thinker.org/imagebase/

log $/GB

year

disk

tape

2001

Page 20: David Culler, NSF Site Visit March 5, 2003

Millennium 20

Emergence of the Sensor Net Era

• 100s of research groups and companies using the Berkeley Mote / TinyOS platform

• dozens of projects on campus

• billions of networked devices connected to the physical world – constantly streaming data

• => start building the storage and processing infrastructure for this new class of system today!

Page 21: David Culler, NSF Site Visit March 5, 2003

Millennium 21

Environment Monitoring Experience

• Canonical “patch” net architecture

• live & historical readings www.greatduckisland.net

• 43 nodes, 7/13-11/18

• above and below ground

• light, temperature, relative humidity, and occupancy data, at 1 minute resolution

• >1 million measurements– Best nodes ~90,000

• 3 major maintenance events

• node design and packaging in harsh environment

– -20 – 100 degrees, rain, wind

• power mgmt and interplay with sensors and environment

Basestation

Gateway

Sensor Patch

Patch Network

Base-Remote Link

Data Service

Internet

Client Data Browsingand Processing

Sensor Node

Transit Network

Page 22: David Culler, NSF Site Visit March 5, 2003

Millennium 22

Sample ResultsNode Lifetime and Utility

Effective Communication Phase

Packet Loss

Correlation