Time to Science/Time to Results: Transforming Research in the Cloud
-
Upload
amazon-web-services -
Category
Technology
-
view
468 -
download
2
Transcript of Time to Science/Time to Results: Transforming Research in the Cloud
Accelerating Time to Science:
Transforming Research in the Cloud
Jamie Kinney - @jamiekinney
Director of Scientific Computing, a.k.a. “SciCo” – Amazon Web Services
Michael Franklin - @amplab
Director, AMPLab - UC Berkeley
Agenda
• An introduction to scientific computing on AWS
• How are researchers using AWS today?
• Case study: The UC Berkeley AMP Lab
• Q & A
What do we mean by Scientific Computing?
Scientific Computing refers to the application of simulation,
mathematical modeling and quantitative analysis to analyze and
solve scientific problems.
How is AWS Used for Scientific Computing?
• High Performance Computing (HPC) for Engineering and Simulation
• High Throughput Computing (HTC) for Data-Intensive Analytics
• Hybrid Supercomputing centers
• Collaborative Research Environments
• Citizen Science
• Science-as-a-Service
Why do researchers love using AWS?
Time to ScienceAccess research
infrastructure in minutes
Low CostPay-as-you-go pricing
ElasticEasily add or remove capacity
Globally AccessibleEasily Collaborate with
researchers around the world
SecureA collection of tools to
protect data and privacy
ScalableAccess to effectively
limitless capacity
Why does AWS care about Scientific Computing?
• We want to improve our world by accelerating the pace of scientific discovery
• It is a great application of AWS with a broad customer base
• The scientific community helps us innovate on behalf of all customers
– Streaming data processing & analytics
– Exabyte scale data management solutions and exaflop scale compute
– Collaborative research tools and techniques
– New AWS regions
– Significant advances in low-power compute, storage and data centers
– Efficiencies which will lower our costs and therefore pricing for all customers
Research Grants
AWS provides free usage
credits to help researchers:
• Teach advanced courses
• Explore new projects
• Create resources for the
scientific community
aws.amazon.com/grants
Peering with all global research networks
Image courtesy John Hover - Brookhaven National Lab
Breaking news! Restricted-access genomics on
AWS
aws.amazon.com/genomics
How are researchers using AWS today?
High Throughput Computing at Scale
The Large Hadron Collider
@ CERN includes 6,000+
researchers from over 40
countries and produces
approximately 25PB of data
each year.
The ATLAS and CMS
experiments are using AWS
for Monte Carlo simulations
and analysis of LHC data.
Data-Intensive Computing
The Square Kilometer Array will link 250,000 radio
telescopes together, creating the world’s most
sensitive telescope. The SKA will generate zettabytes
of raw data, publishing exabytes annually over 30-40
years.
Researchers are using AWS to develop and test:
• Data processing pipelines
• Image visualization tools
• Exabyte-scale research data management
• Collaborative research environments
aws.amazon.com/solutions/case-studies/icrar/
High Performance Computing
Simulations in the Automotive Sector• Crash and materials simulations
• Fluid and thermal dynamics simulations
• Car body aerodynamics
• Electronics and electromagnetic simulations
Honda materials science simulations on AWS:• Deploying scalable HPC clusters on AWS Spot – up to 1000 C3 instances
• Running more simulations than before, for more accurate results
“Cloud offers us an opportunity, as we can innovate faster than before.”
- Ayumi Tada, IT System Administrator, Honda R&D
Schrodinger & Cycle Computing:
Computational Chemistry for Better Solar Power
Simulation by Mark Thompson of the
University of Southern California to see
which of 205,000 organic compounds
could be used for photovoltaic cells for
solar panel material.
Estimated computation time 264 years
completed in 18 hours.
• 156,314 core cluster, 8 regions
• 1.21 petaflops (Rpeak)
• $33,000 or 16¢ per molecule
Loosely
Coupled
Science-as-a-Service
Globus Genomics, DNAnexus, and SevenBridges Genomics offer inexpensive, easy-
to-use, and secure platforms for processing and analyzing genomic data.
The Weather Company pushes four gigabytes of data to AWS
each second in order to delivers 15 billion forecasts each day
to their customers around the world.
aws.amazon.com/solutions/case-studies/the-weather-company/
Citizen Science
The Asteroid Data Hunters competition used AWS to develop better mechanisms for
finding near-Earth asteroids. The top algorithm is 18% better at finding asteroids!
Case Study: The UC Berkeley AMP Lab
Scalable Data-Driven
Science at the AMPLab
UC BERKELEY
Michael Franklin
April 9, 2015
AWS Summit SF
AMPLab Overview
• 80+ Students, Postdocs, Faculty and Staff from:
Databases, Machine Learning, Systems, Security, and Networking
• 28 Industry Sponsors +
White House Big Data Program:
NSF CISE Expeditions in Computing and Darpa XData
• Founding Sponsors:
“… Berkeley’s AMPLab has already left an indelible mark on the world of
information technology, and even the web. But we haven’t yet experienced
the full impact of the group … Not even close.”
– Derrick Harris, GigaOM, Aug 2, 2014
Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldbergCuller
AMPLab: Integrating 3
Resources
Algorithms
• Machine Learning, Statistical Methods
• Prediction, Business Intelligence
Machines
• Clusters and Clouds
• Warehouse Scale Computing
People
• Crowdsourcing, Human Computation
• Data Scientists, Analysts
Berkeley Data Analytics Stack
(Apache and BSD open source)
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Open Source Community Building
MeetUp on MLbase @Twitter (Aug 6, 2013)
Spark Summit SF (June 30, 2014)
Apps: Genomics Patterson et al.
Using BDAS, SNAP (Scalable Nucleotide
Alignment) aligns in minutes vs. days
Why Speed Matters: A real-world use case
ADAM – Data formats and Processing
Patterns for Genomics on Big Data Platforms
(e.g., Spark)
Collaborations with: UCSF, UCSC, OHSU,
Microsoft Research, Mt. SinaiM. Wilson, …, and C. Chiu, “Actionable Diagnosis of Neuroleptospirosis by Next-Generation Sequencing”,
June 4, 2014, New England Journal of Medicine.
June 4, 2014
5
June 4, 2014
In a First, Test of DNA Finds Root of Illness By CARL ZIMMER JUNE 4, 2014
Joshua Osborn, 14, lay in a coma at American Family Children’s Hospital in Madison,
Wis. For weeks his brain had been swelling with fluid, and a battery of tests had failed to
reveal the cause.
The doctors told his parents, Clark and Julie, that they wanted to run one more test with
an experimental new technology. Scientists would search Joshua’s cerebrospinal fluid for
pieces of DNA. Some of them might belong to the pathogen causing his encephalitis.
The Osborns agreed, although they were skeptical that the test would succeed where so
many others had failed. But in the first procedure of its kind, researchers at the University
of California, San Francisco, managed to pinpoint the cause of Joshua’s problem —
within 48 hours. He had been infected with an obscure species of bacteria. Once
identified, it was eradicated within days.
The case, reported on Wednesday in The New England Journal of Medicine, signals an
important advance in the science of diagnosis. For years, scientists have been sequencing
DNA to identify pathogens. But until now, the process has been too cumbersome to yield
useful information about an individual patient in a life-threatening emergency.
“This is an absolutely great story — it’s a tremendous tour de force,” said Tom Slezak,
the leader of the pathogen informatics team at the Lawrence Livermore National
Laboratory, who was not involved in the study.
Mr. Slezak and other experts noted that it would take years of further research before
such a test might become approved for regular use. But it could be immensely useful: Not
only might it provide speedy diagnoses to critically ill patients, they said, it could lead to
more effective treatments for maladies that can be hard to identify, such as Lyme disease.
Diagnosis is a crucial step in medicine, but it can also be the most difficult. Doctors
usually must guess the most likely causes of a medical problem and then order individual
tests to see which is the right diagnosis.
The guessing game can waste precious time. The causes of some conditions, like
encephalitis, can be so hard to diagnose that doctors often end up with no answer at all.
“About 60 percent of the time, we never make a diagnosis” in encephalitis, said Dr.
Michael R. Wilson, a neurologist at the University of California, San Francisco, and an
author of the new paper. “It’s frustrating whenever someone is doing poorly, but it’s
especially frustrating when we can’t even tell the parents what the hell is going on.”
For the last decade, researchers at the university have been working on methods for
identifying pathogens based on their DNA. In 2003 Dr. Joseph DeRisi, a biochemist at https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/5
SNAP
Carat Collaborative Battery App
24
750,000+
downloads
Big Data Ecosystem
Evolution
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill Tez
Impala
S4 …
Specialized systems(iterative, interactive and
streaming apps)
General batch
processing
AMPLab Unification PhilosophyDon’t specialize MapReduce – generalize it!
Two additions to Hadoop MR can enable all the models shown earlier!
1. General Task DAGs
2. Data Sharing
For Users:
Fewer Systems to Use
Less Data MovementSpark
Str
eam
ing
Gra
phX
…Spark
SQ
L
MLb
ase
In-Memory
Dataflow
System
M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.
“It’s only September but it’s already clear that 2014 will
be the year of Apache Spark”
-- Datanami, 9/15/14
• Developed in AMPLab and its predecessor the RADLab
• Alternative to Hadoop MapReduce
• 10-100x speedup for ML and interactive queries
• Central component of the BDAS Stack
• “Graduated” to Apache Foundation -> Apache Spark
Apache Spark Contributors:
0
25
50
75
100
2011 2012 2013 2014
400+ contributors to current release
Apache Spark:
Compared to Other Projects
Ma
pR
educe
YA
RN
HD
FS
Sto
rm
Spark
0
500
1000
1500
2000
MapR
educe
YA
RN
HD
FS
Sto
rm
Spark
0
50000
100000
150000
200000
250000
300000
350000
Commits Lines of Code Changed
Activity in past 6 months
2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3,
Julia, …
Iteration in MapReduce
Training
Data
Map Reduce LearnedModel
w(1)
w(2)
w(3)
w(0)
Initial
Model
Cost of Iteration in MapReduceMap Reduce Learned
Model
w(1)
w(2)
w(3)
w(0)
Initial
Model
Training
Data
Read 2Repeatedly
load same data
Cost of Iteration in MapReduce
Map Reduce LearnedModel
w(1)
w(2)
w(3)
w(0)
Initial
Model
Training
DataRedundantly save
output between
stages
Dataflow View
Training
Data
(HDFS)
Map
Re
duc
e
MapR
ed
uc
e
Map
Re
duc
e
Memory Opt. Dataflow
Training
Data
(HDFS)
Map
Re
duc
e
MapR
ed
uc
e
Map
Re
duc
e
Cached
Load
Memory Opt. Dataflow View
Training
Data
(HDFS)
Map
Re
duc
e
MapR
ed
uc
e
Map
Re
duc
e
Efficiently
move data
between
stages
Spark:10-100× faster than Hadoop MapReduce
Resilient Distributed Datasets (RDDs)API: coarse-grained transformations (map, group-by, join, sort, filter,
sample,…) on immutable collections
Resilient Distributed Datasets (RDDs)» Collections of objects that can be stored in memory or disk across a cluster
» Built via parallel transformations (map, filter, …)
» Automatically rebuilt on failure
Rich enough to capture many models:» Data flow models: MapReduce, Dryad, SQL, …
» Specialized models: Pregel, Hama, …
M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012.
Abstraction: Dataflow Operators
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
Apache Spark v1.3 (3/15)
Includes» Spark (core)
» Spark Streaming
» GraphX
» MLlib
» Spark SQL – Query Processing
Wide range of interfaces:» Enhanced Dataframes API
» Python / interactive ipython
» Scala / interactive scala shell
» R / interactive R-shell
» Java
Now included in all major Hadoop distributions
Data Intensive GenomicsNew population-scale experiments will sequence 10-100k
samples• 100k samples @ 60x WGS will generate ~20PB of read data and
~300TB of genotype data
End-to-end pipeline latency is important to clinical work
We want to jointly analyze samples to uncover low
frequency variations
How can we improve analysis
productivity?Flat file formats sacrifice interoperability but do not improve
performance
Common sort order invariants imposed by tools compromise
correctness
Genomics APIs tend to be at a lower level of abstraction, which
compromises productivity
ADAMAn open source, high performance, distributed platform for genomic
analysis
ADAM defines a:
1. Data schema and layout on disk*
2. Programming interface for distributed processing of genomic
data**
3. Command line interface* Via Parquet and Avro
** Work on Python integration is underway
Data Model is the "Narrow Waist"
Data FormatSchema can be updated without
breaking backwards compatibility
Normalize metadata fields into schema
for O(1) metadata access
Models are “dumb”; enhance as
necessary with rich objects
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Schemas at https://www.github.com/bigdatagenomics/bdg-formats
Parquet: A Modern Big Data Storage
FormatASF Incubator project, based on Google
Dremel
High performance columnar store with
support for projections and push-down
predicates
Short read data stored in Parquet achieves a
25% improvement in size over compressed
BAM
Enables scale-out using modern Big Data
technology (e.g., Spark)
Image from Parquet format definition: https://www.github.com/apache/incubator-parquet-format
ADAM’s API
ADAM is built on top of Apache Spark, which provides the RDD
abstraction —> distributed arrays
Common primitives include:
• Aggregates: BQSR, Indel Realignment
• Bucketing: Duplicate Marking, Concordance
• Region Joins: Variant Calling and Filtration
Adam Performance Bottom Line
F. Nothaft, et. al., “Rethinking Data-Intensive
Science Using Scalable Analytics Systems”,
ACM SIGMOD Conf., June 2015, to appear.
$214.39
$78.92
ADAM Performance Update
Analysis run using Amazon EC2, single node was hs1.8xlarge, cluster was m2.4xlarge
Scripts available at https://www.github.com/fnothaft/bdg-recipes.git, “sigmod" branch
Achieve linear scalability out to
128 nodes for most tasks
2-4x improvement over {GATK,
samtools,Picard} on single node
Scalable Analytics for ScienceData Model is the “narrow waist” of the architecture
Modern “NoSQL” models support evolution and heterogeneity with high
performance.
BDAS Declarative Analytics: Specify What not How
MLBase chooses:
• Algorithms/Operators
• Ordering and Physical Placement
• Parameter and Hyperparameter Settings
• Featurization
Leverages BDAS (Spark, GraphX, Tachyon) and Hadoop File System
for Speed and Scale
To find out more or get
involved:
amplab.berkeley.edu
UC BERKELEY
Thanks to NSF CISE Expeditions in Computing, DARPA XData,
Founding Sponsors: Amazon Web Services, Google, and SAP,
the Thomas and Stacy Siebel Foundation,
all our industrial sponsors and partners, and all the members of the AMPLab Team.
Additional resources…
• aws.amazon.com/hpc
• aws.amazon.com/big-data
• aws.amazon.com/grants
• aws.amazon.com/genomics
• aws.amazon.com/compliance
• aws.amazon.com/security
SAN FRANCISCO
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved