Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more...

26
www.ci.anl.gov www.ci.uchicago.edu Rethinking cyberinfrastructure for massive data but modest budgets Ian Foster

Transcript of Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more...

Page 1: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

Rethinking cyberinfrastructure for

massive data but

modest budgets

Ian Foster

Page 2: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

The data deluge

Page 3: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

3

The data deluge in biology

x105 in 6 years

x10 in 6 years

Page 4: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

4

Number of sequencing machines

http://omicsmaps.com/

Page 5: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

5

The challenge of staying competitive

"Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.”

"A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"

Page 6: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

6

Small science is struggling

More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates

Page 7: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

7

Medium science struggles too

• Dark Energy Survey receives 100,000 files each night in Illinois

• They transmit files to Texas for analysis … then move results back to Illinois

• Process must be reliable, routine, and efficient

• The IT team is not large Image credit: Roger Smith/NOAO/AURA/NSF

Blanco 4m on Cerro Tololo

Page 8: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

8

A crisis that demands new approaches

• We have exceptional infrastructure for the 1% (e.g., supercomputers, Large Hadron Collider, …)

• But not for the 99% (e.g., the vast majority of the 1.8M publicly funded researchers in the EU)

We need new approaches to providing research cyberinfrastructure, that: — Reduce barriers to entry — Are cheaper — Are sustainable

Page 9: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

9

Home of the

Research Cloud Amplifying human capabilities

Leader in

Data Science Understanding systems of systems

Interdisciplinary

Forum Exchange, education, engagement

The Computation Institute

Page 10: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

10

You can run a company from a coffee shop

Page 11: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

11

Because businesses outsource their IT

Web presence

Email (hosted Exchange)

Calendar

Telephony (hosted VOIP)

Human resources and payroll

Accounting

Customer relationship mgmt

Software as a Service

(SaaS)

Page 12: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

12

And often their large-scale computing too

Web presence

Email (hosted Exchange)

Calendar

Telephony (hosted VOIP)

Human resources and payroll

Accounting

Customer relationship mgmt

Data analytics

Content distribution

Infrastructure as a Service

(IaaS)

Software as a Service

(SaaS)

Page 13: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

Consumers also outsource much of their IT

Page 14: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

14

Let’s rethink how we provide research IT

Accelerate discovery and innovation worldwide by providing research IT as a service

Leverage the cloud to

• provide millions of researchers with unprecedented access to powerful tools;

• enable a massive shortening of cycle times in time-consuming research processes; and

• reduce research IT costs dramatically via economies of scale

Page 15: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

15

Cloud layers

15

Software-as-a-Service: SaaS

Platform-as-a-Service: PaaS

Infrastructure-as-a-Service: IaaS

Page 16: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

16

Common research data management steps

• Dark Energy Survey

• Galaxy genomics

• LIGO observatory

• SBGrid structural biology consortium

• NCAR climate data applications

• Land use change; economics

Page 17: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

17

Scientific data delivery, 2012

• “[A] majority of users at BES facilities … physically transport data to a home institution using portable media … data volumes are going to increase significantly in the next few years (to 70 TB/day or more) – data must be transferred over the network”

• “the effectiveness of data transfer middleware [is] not just on the transfer speed, but also the time and interruption to other work required to supervise and check on the success of large data transfers”

• “It took two weeks and email traffic between network specialists at NERSC and ORNL, sys-admins at NERSC, … and combustion staff at ORNL and SNL to move 10 TB from NERSC to ORNL”

[ESNet Network Requirements Workshops, 2007-2010]

Major usability, productivity, performance problems

1980

Page 18: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

18

The challenge: Moving big data easily

What should be trivial …

… can be painfully tedious and time-consuming

“I need my data over there – at my _____” ( supercomputing

center, campus server, etc.)

Data

Source

Data

Destination

! Config issues

! Unexpected failure = manual retry

Data

Source Data

Destination

“GAAAH!%&@#&”

! Firewall issues

Page 19: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

• GO PICTURE

Page 20: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

20

Globus Online: Data transfer as SaaS

• Reliable file transfer. – Easy “fire-and-forget” transfers

– Automatic fault recovery

– High performance

– Across multiple security domains

• No IT required. – Software as a Service (SaaS)

• No client software installation

• New features automatically available

– Consolidated support & troubleshooting

– Works with existing GridFTP servers

– Globus Connect solves “last mile problem”

• >3500 registered users, >3 Petabytes moved

Recommended by XSEDE, NERSC, Blue Waters, and many campuses

Page 21: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

21

Towards “research IT as a service”

Research data management as a service Globus

Transfer

Globus

Collaborate

Globus

Storage

Globus Integrate platform

... SaaS

PaaS

Page 22: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

22

Page 23: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

23

A 21st C research cyberinfrastructure

L L

L L

L

L L

L

L L

L

L L

L

L L

L

L

L L

L L

L

L L

L

L

P P P P

Research data management Collaboration, computation Research administration

• To provide more capability for more people at less cost …

• Create infrastructure

– Robust and universal

– Economies of scale

– Positive returns to scale

• Via the creative use of

– Aggregation (“cloud”)

– Federation (“grid”)

Small and medium laboratories and projects

a a S

P

Page 24: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

24

Acknowledgments

• Colleagues at UChicago and Argonne

– Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu Malik, Rachana Ananthakrisnan, Raj Kettimuthu, and others listed at www.globusonline.org/about/goteam/

• NSF OCI and MPS

• DOE ASCR

• NIH

Page 25: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

25

For more information

Attend GlobusWorld in Chicago, April 10-12, 2012

• www.globusonline.org

• Twitter: @globusonline, Globus Online on Facebook

• Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.

• Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Software as a Service for Data Scientists. Communications of the ACM, Feb, 2012.

Page 26: Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates . 7 Medium

www.ci.anl.gov www.ci.uchicago.edu

Thank you!

[email protected]

www.globusonline.org

Twitter: @globusonline, @ianfoster