High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biomedical Sciences
Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more...
Transcript of Rethinking cyberinfrastructure for massive data...Small science is struggling More data, more...
www.ci.anl.gov www.ci.uchicago.edu
Rethinking cyberinfrastructure for
massive data but
modest budgets
Ian Foster
www.ci.anl.gov www.ci.uchicago.edu
The data deluge
www.ci.anl.gov www.ci.uchicago.edu
3
The data deluge in biology
x105 in 6 years
x10 in 6 years
www.ci.anl.gov www.ci.uchicago.edu
4
Number of sequencing machines
http://omicsmaps.com/
www.ci.anl.gov www.ci.uchicago.edu
5
The challenge of staying competitive
"Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.”
"A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"
www.ci.anl.gov www.ci.uchicago.edu
6
Small science is struggling
More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates
www.ci.anl.gov www.ci.uchicago.edu
7
Medium science struggles too
• Dark Energy Survey receives 100,000 files each night in Illinois
• They transmit files to Texas for analysis … then move results back to Illinois
• Process must be reliable, routine, and efficient
• The IT team is not large Image credit: Roger Smith/NOAO/AURA/NSF
Blanco 4m on Cerro Tololo
www.ci.anl.gov www.ci.uchicago.edu
8
A crisis that demands new approaches
• We have exceptional infrastructure for the 1% (e.g., supercomputers, Large Hadron Collider, …)
• But not for the 99% (e.g., the vast majority of the 1.8M publicly funded researchers in the EU)
We need new approaches to providing research cyberinfrastructure, that: — Reduce barriers to entry — Are cheaper — Are sustainable
www.ci.anl.gov www.ci.uchicago.edu
9
Home of the
Research Cloud Amplifying human capabilities
Leader in
Data Science Understanding systems of systems
Interdisciplinary
Forum Exchange, education, engagement
The Computation Institute
www.ci.anl.gov www.ci.uchicago.edu
10
You can run a company from a coffee shop
www.ci.anl.gov www.ci.uchicago.edu
11
Because businesses outsource their IT
Web presence
Email (hosted Exchange)
Calendar
Telephony (hosted VOIP)
Human resources and payroll
Accounting
Customer relationship mgmt
Software as a Service
(SaaS)
www.ci.anl.gov www.ci.uchicago.edu
12
And often their large-scale computing too
Web presence
Email (hosted Exchange)
Calendar
Telephony (hosted VOIP)
Human resources and payroll
Accounting
Customer relationship mgmt
Data analytics
Content distribution
Infrastructure as a Service
(IaaS)
Software as a Service
(SaaS)
Consumers also outsource much of their IT
www.ci.anl.gov www.ci.uchicago.edu
14
Let’s rethink how we provide research IT
Accelerate discovery and innovation worldwide by providing research IT as a service
Leverage the cloud to
• provide millions of researchers with unprecedented access to powerful tools;
• enable a massive shortening of cycle times in time-consuming research processes; and
• reduce research IT costs dramatically via economies of scale
www.ci.anl.gov www.ci.uchicago.edu
15
Cloud layers
15
Software-as-a-Service: SaaS
Platform-as-a-Service: PaaS
Infrastructure-as-a-Service: IaaS
www.ci.anl.gov www.ci.uchicago.edu
16
Common research data management steps
• Dark Energy Survey
• Galaxy genomics
• LIGO observatory
• SBGrid structural biology consortium
• NCAR climate data applications
• Land use change; economics
www.ci.anl.gov www.ci.uchicago.edu
17
Scientific data delivery, 2012
• “[A] majority of users at BES facilities … physically transport data to a home institution using portable media … data volumes are going to increase significantly in the next few years (to 70 TB/day or more) – data must be transferred over the network”
• “the effectiveness of data transfer middleware [is] not just on the transfer speed, but also the time and interruption to other work required to supervise and check on the success of large data transfers”
• “It took two weeks and email traffic between network specialists at NERSC and ORNL, sys-admins at NERSC, … and combustion staff at ORNL and SNL to move 10 TB from NERSC to ORNL”
[ESNet Network Requirements Workshops, 2007-2010]
Major usability, productivity, performance problems
1980
www.ci.anl.gov www.ci.uchicago.edu
18
The challenge: Moving big data easily
What should be trivial …
… can be painfully tedious and time-consuming
“I need my data over there – at my _____” ( supercomputing
center, campus server, etc.)
Data
Source
Data
Destination
! Config issues
! Unexpected failure = manual retry
Data
Source Data
Destination
“GAAAH!%&@#&”
! Firewall issues
• GO PICTURE
www.ci.anl.gov www.ci.uchicago.edu
20
Globus Online: Data transfer as SaaS
• Reliable file transfer. – Easy “fire-and-forget” transfers
– Automatic fault recovery
– High performance
– Across multiple security domains
• No IT required. – Software as a Service (SaaS)
• No client software installation
• New features automatically available
– Consolidated support & troubleshooting
– Works with existing GridFTP servers
– Globus Connect solves “last mile problem”
• >3500 registered users, >3 Petabytes moved
Recommended by XSEDE, NERSC, Blue Waters, and many campuses
www.ci.anl.gov www.ci.uchicago.edu
21
Towards “research IT as a service”
Research data management as a service Globus
Transfer
Globus
Collaborate
Globus
Storage
Globus Integrate platform
... SaaS
PaaS
www.ci.anl.gov www.ci.uchicago.edu
22
www.ci.anl.gov www.ci.uchicago.edu
23
A 21st C research cyberinfrastructure
L L
L L
L
L L
L
L L
L
L L
L
L L
L
L
L L
L L
L
L L
L
L
P P P P
Research data management Collaboration, computation Research administration
• To provide more capability for more people at less cost …
• Create infrastructure
– Robust and universal
– Economies of scale
– Positive returns to scale
• Via the creative use of
– Aggregation (“cloud”)
– Federation (“grid”)
Small and medium laboratories and projects
a a S
P
www.ci.anl.gov www.ci.uchicago.edu
24
Acknowledgments
• Colleagues at UChicago and Argonne
– Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu Malik, Rachana Ananthakrisnan, Raj Kettimuthu, and others listed at www.globusonline.org/about/goteam/
• NSF OCI and MPS
• DOE ASCR
• NIH
www.ci.anl.gov www.ci.uchicago.edu
25
For more information
Attend GlobusWorld in Chicago, April 10-12, 2012
• www.globusonline.org
• Twitter: @globusonline, Globus Online on Facebook
• Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
• Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Software as a Service for Data Scientists. Communications of the ACM, Feb, 2012.
www.ci.anl.gov www.ci.uchicago.edu
Thank you!
www.globusonline.org
Twitter: @globusonline, @ianfoster