Cyberinfrastructure and Applications Overview: Howard University June22

27
1 Cyberinfrastructure and the Breadth of Its Application Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department Director Community Grids Laboratory and Digital Science Center Indiana University Bloomington IN 47404 (Presenter: Marlon Pierce) [email protected] http://www.infomall.org [email protected]

description

Presentation on High Performance Computing-Cyberinfrastructure (CI) Campus Bridging Workshop at Howard University, June 22, 2009.

Transcript of Cyberinfrastructure and Applications Overview: Howard University June22

Page 1: Cyberinfrastructure and Applications Overview: Howard University June22

11

Overview of Cyberinfrastructure and the Breadth of Its Application

Geoffrey FoxComputer Science, Informatics, Physics

Chair Informatics DepartmentDirector Community Grids Laboratory and Digital Science Center

Indiana University Bloomington IN 47404(Presenter: Marlon Pierce)

[email protected]://www.infomall.org

[email protected]

Page 2: Cyberinfrastructure and Applications Overview: Howard University June22

22

Time

Parallel Computing

Grids and Federated

Computing

Scientific Enterprise Computing

Scientific Web 2.0

Cloud Computing

Parallel Computing

Evolution of Scientific Computing, 1985-2010

Evidence of Intelligent Design?

Y-Axis is whatever you want it to be.

Page 3: Cyberinfrastructure and Applications Overview: Howard University June22

What is High Performance Computing?

The meaning of this was clear 20 years ago when we were planning/starting the HPCC (High Performance Computing and Communication) Initiative

It meant parallel computing and HPCC lasted for 10 years As an outgrowth of this, NSF started funding of supercomputer

centers and we debated vector versus “massively parallel systems”. Data did not exist ….• TeraGrid is the current incarnation.

NSF subsequently established the Office of Cyberinfrastructure• Comprehensive approach to physical infrastructure

Complementary NSF concept “Computational Thinking” • Everyone needs cyberinfrastructure

Core idea is always connecting resources through messages: MPI, JMS, XML, Twitter, etc.

33

Page 4: Cyberinfrastructure and Applications Overview: Howard University June22

4

TeraGrid High Performance Computing Systems 2007-8

Computational Resources (size approximate - not to scale)

Slide Courtesy Tommy Minyard, TACC

SDSC

TACC

NCSA

ORNL

PU

IU

PSC

NCAR

(504TF)

2008(~1PF)

Tennessee

LONI/LSU

UC/ANL

Page 5: Cyberinfrastructure and Applications Overview: Howard University June22

5

• Resources for many disciplines!

• > 120,000 processors in aggregate

• Resource availability grew during 2008 at unprecedented rates

Page 6: Cyberinfrastructure and Applications Overview: Howard University June22

TOTEM pp, general purpose; HI

pp, general purpose; HI

LHCb: B-physics

ALICE : HI

pp s =14 TeV L=1034 cm-2 s-1

27 km Tunnel in Switzerland & France

Large Hadron Collider CERN, Geneva: 2008 Start

Large Hadron Collider CERN, Geneva: 2008 Start

CMS

Atlas

Higgs, SUSY, Extra Dimensions, CP Violation, QG Plasma, … the Unexpected

5000+ Physicists 250+ Institutes 60+ Countries

Challenges: Analyze petabytes of complex data cooperativelyHarness global computing, data & network resources

Page 7: Cyberinfrastructure and Applications Overview: Howard University June22

77

Linked Environments for Atmospheric Discovery Grid services triggered by abnormal events and controlled by workflow process real

time data from radar and high resolution simulations for tornado forecasts

Typical graphical interface to service composition

Page 8: Cyberinfrastructure and Applications Overview: Howard University June22

CYBERINFRASTRUCTURE CENTER FOR POLAR SCIENCE (CICPS)

8

Page 9: Cyberinfrastructure and Applications Overview: Howard University June22

99

Environmental Monitoring Cyberinfrastructure at Clemson

Page 10: Cyberinfrastructure and Applications Overview: Howard University June22

10

Page 11: Cyberinfrastructure and Applications Overview: Howard University June22

Forces on Cyberinfrastructure:

Clouds, Multicore, and Web 2.0

1111

Page 12: Cyberinfrastructure and Applications Overview: Howard University June22

1212

Gartner 2008 Technology Hype Curve

Clouds, Microblogs and Green IT appearBasic Web Services, Wikis and SOA becoming mainstream

Clouds, Microblogs and Green IT appearBasic Web Services, Wikis and SOA becoming mainstream

Page 13: Cyberinfrastructure and Applications Overview: Howard University June22

1313

Gartner’s 2005 Hype Curve

Page 14: Cyberinfrastructure and Applications Overview: Howard University June22

1414

Relevance of Web 2.0 Web 2.0 can help e-Research in many ways Its tools (web sites) can enhance scientific collaboration,

i.e. effectively support virtual organizations, in different ways from grids

The popularity of Web 2.0 can provide high quality technologies and software that (due to large commercial investment) can be very useful in e-Research and preferable to complex Grid or Web Service solutions

The usability and participatory nature of Web 2.0 can bring science and its informatics to a broader audience

Cyberinfrastructure is research analogue of major commercial initiatives e.g. to important job opportunities for students!

Page 15: Cyberinfrastructure and Applications Overview: Howard University June22

Enterprise Approach Web 2.0 Approach

JSR 168 Portlets Google Gadgets, Widgets, badges

Server-side integration and processing AJAX, client-side integration and processing, JavaScript

SOAP RSS, Atom, JSON

WSDL REST (GET, PUT, DELETE, POST)

Portlet Containers Open Social Containers (Orkut, LinkedIn, Shindig); Facebook; StartPages

User Centric Gateways Social Networking Portals

Workflow managers (Taverna, Kepler, XBaya, etc)

Mash-ups

WS-Eventing, WS-Notification, Enterprise Messaging

Blogging and Micro-blogging with REST, RSS/Atom, and JSON messages (Blogger, Twitter)

Semantic Web: RDF, OWL, ontologies Microformats, folksonomies

Page 16: Cyberinfrastructure and Applications Overview: Howard University June22

Cloud Computing: Infrastructure and Runtimes

Cloud infrastructure: outsourcing of servers, computing, data, file space, etc.• Handled through Web services that control virtual machine

lifecycles. Cloud runtimes: tools for using clouds to do data-

parallel computations. • Apache Hadoop, Google MapReduce, Microsoft Dryad, and

others • Designed for information retrieval but are excellent for a wide

range of machine learning and science applications. Apache Mahout

• Also may be a good match for 32-128 core computers available in the next 5 years.

Page 17: Cyberinfrastructure and Applications Overview: Howard University June22

Some Commercial CloudsCloud/Service

Amazon Microsoft Azure

Google (and Apache)

Data S3, EBS, SimpleDB

Blob, Table, SQL Services

GFS, BigTable

Computing EC2, Elastic Map Reduce (runs Hadoop)

Compute Service

MapReduce (not public, but Hadoop)

Service Hosting

Amazon Load Balancing

Web Hosting Service

AppEngine/AppDrop

Bold faced entries have open source equivalents Bold faced entries have open source equivalents

Page 18: Cyberinfrastructure and Applications Overview: Howard University June22

Clouds as Cost Effective Data Centers

1818

Exploit the Internet by allowing one to build giant data centers with 100,000’s of computers; ~ 200-1000 to a shipping container

“Microsoft will cram between 150 and 220 shipping containers filled with data center gear into a new 500,000 square foot Chicago facility. This move marks the most significant, public use of the shipping container systems popularized by the likes of Sun Microsystems and Rackable Systems to date.”

Page 19: Cyberinfrastructure and Applications Overview: Howard University June22

Clouds Hide Complexity Build portals around all computing capability SaaS: Software as a Service IaaS: Infrastructure as a Service or HaaS: Hardware as

a Service PaaS: Platform as a Service delivers SaaS on IaaS Cyberinfrastructure is “Research as a Service”

1919

2 Google warehouses of computers on the banks of the Columbia River, in The Dalles, OregonSuch centers use 20MW-200MW (Future) each 150 watts per coreSave money from large size, positioning with cheap power and access with Internet

Page 20: Cyberinfrastructure and Applications Overview: Howard University June22

Open Architecture Clouds Amazon, Google, Microsoft, et al., don’t tell you how to build

a cloud.• Proprietary knowledge

Indiana University and others want to document this publically. • What is the right way to build a cloud?

• It is more than just running software. What is the minimum-sized organization to run a cloud?

• Department? University? University Consortium? Outsource it all?

• Analogous issues in government, industry, and enterprise. Example issues:

• What hardware setups work best? What are you getting into?

• What is the best virtualization technology for different problems?

Page 21: Cyberinfrastructure and Applications Overview: Howard University June22

Data-File Parallelism and Clouds Now that you have a cloud, you may want to do large

scale processing with it. Classic problems are to perform the same (sequential)

algorithm on fragments of extremely large data sets. Cloud runtime engines manage these replicated

algorithms in the cloud.• Can be chained together in pipelines (Hadoop) or DAGs

(Dryad).• Runtimes manage problems like failure control.

We are exploring both scientific applications and classic parallel algorithms (clustering, matrix multiplication) using Clouds and cloud runtimes.

Page 22: Cyberinfrastructure and Applications Overview: Howard University June22

Data Intensive Research Research is advanced by observation i.e.

analyzing data from Gene Sequencers Accelerators Telescopes Environmental Sensors Web Crawlers Ethnographic Interviews

This data is “filtered”, “analyzed”, “data mined” (term used in Computer Science) to produce conclusions

Weather forecasting and Climate prediction are of this type

2222

Page 23: Cyberinfrastructure and Applications Overview: Howard University June22

Geospatial Examples Image processing and mining

• Ex: SAR Images from Polar Grid project (J. Wang)

• Apply to 20 TB of data Flood modeling I

• Chaining flood models over a geographic area.

Flood modeling II• Parameter fits and inversion

problems. Real time GPS processing

Filter

Page 24: Cyberinfrastructure and Applications Overview: Howard University June22

Parallel Clustering and Parallel Multidimensional

Scaling MDS

2424

4500 Points : Pairwise Aligned

4500 Points : Clustal MSA

3000 Points : Clustal MSAKimura2 Distance

Applied to ~5000 dimensional gene sequences and ~20 dimensional patient record data

Very good parallel speedup

4000 Points : Patient RecordData on Obesity and Environment

Page 25: Cyberinfrastructure and Applications Overview: Howard University June22

Some Other File/Data Parallel Examples from Indiana University Biology Dept

EST (Expressed Sequence Tag) Assembly: (Dong) 2 million mRNA sequences generates 540000 files taking 15 hours on 400 TeraGrid nodes (CAP3 run dominates)

MultiParanoid/InParanoid gene sequence clustering: (Dong) 476 core years just for Prokaryotes

Population Genomics: (Lynch) Looking at all pairs separated by up to 1000 nucleotides

Sequence-based transcriptome profiling: (Cherbas, Innes) MAQ, SOAP

Systems Microbiology: (Brun) BLAST, InterProScan Metagenomics (Fortenberry, Nelson) Pairwise alignment of

7243 16s sequence data took 12 hours on TeraGrid All can use Dryad or Hadoop

2525

Page 26: Cyberinfrastructure and Applications Overview: Howard University June22

Intel’s Projection

Technology might support:2010: 16—64 cores 200GF—1 TF2013: 64—256 cores 500GF– 4 TF2016: 256--1024 cores 2 TF– 20 TF

Page 27: Cyberinfrastructure and Applications Overview: Howard University June22

Too much Computing? Historically both grids and parallel computing have tried to

increase computing capabilities by• Optimizing performance of codes at cost of re-usability• Exploiting all possible CPU’s such as Graphics co-

processors and “idle cycles” (across administrative domains)

• Linking central computers together such as NSF/DoE/DoD supercomputer networks without clear user requirements

Next Crisis in technology area will be the opposite problem – commodity chips will be 32-128way parallel in 5 years time and we currently have no idea how to use them on commodity systems – especially on clients• Only 2 releases of standard software (e.g. Office) in this

time span so need solutions that can be implemented in next 3-5 years

Intel RMS analysis: Gaming and Generalized decision support (data mining) are ways of using these cycles