Generating Biomedical Hypotheses Using Semantic Web Technologies
High Performance Biomedical Applications Using Cloud Technologies
description
Transcript of High Performance Biomedical Applications Using Cloud Technologies
SALSASALSA
High Performance Biomedical Applications Using Cloud Technologies
HPC and Grid Computing in the Cloud Workshop (OGF27 )October 13, 2009, Banff Canada
Judy [email protected] www.infomall.org/salsa
Community Grids LaboratoryPervasive Technology Institute
Indiana University
SALSA
Collaborators in SALSA Project
Indiana UniversitySALSA Technology Team
Geoffrey Fox Judy QiuScott BeasonJaliya Ekanayake Thilina GunarathneThilina Gunarathne
Jong Youl ChoiYang RuanSeung-Hee BaeHui LiSaliya Ekanayake
Microsoft ResearchTechnology Collaboration
Azure (Clouds)Dennis GannonDryad (Parallel Runtime)Roger BargaChristophe Poulain CCR (Threading)George ChrysanthakopoulosDSS (Services)Henrik Frystyk Nielsen
Applications
Bioinformatics, CGB Haixu Tang, Mina Rho, Peter Cherbas, Qunfeng DongIU Medical School Gilbert LiuDemographics (Polis Center) Neil DevadasanCheminformatics David Wild, Qian ZhuPhysics CMS group at Caltech (Julian Bunn)
Community Grids Laband UITS RT – PTI
SALSA
Bio-Computing a Major Focus of 22nd Annual SC Conference
• “Biological research today is driven by the acceleration of knowledge creation, explosion in data around the world, and growing interdependence of disciplines. New HPC solutions allow for far more comprehensive approaches to scientific investigation and enable a systems approach to understanding and predicting life, which is fundamental to the global challenges in medicine, energy and defense.”
Peg Folta, head of the SC09 Bio-Computing Thrust Area
• “Our discussion at SC09 will explore the possibility of on-demand access to computing resources that democratize access to the diverse, rapidly expanding and distributed data generated in biology, along with sharing information about our planned Systems Biology Knowledgebase.”
Susan Gregurick, DOE Program Manager
• “Bio-Computing and computationally intense applications in genomics and sequencing represent a tremendous growth area for HPC technologies, and an emerging area of interest for a large amount of HPC professionals. “
Chris Heier, president of Tycrid Platform Technology
SALSA
Data Intensive (Science) Applications
Bare metal (Computer, network, storage)
FutureGrid/VM
Cloud Technologies(MapReduce, Dryad, Hadoop)
Classic HPCMPI, Threading
Applications Biology: Expressed Sequence Tag (EST) sequence assembly (CAP3) Biology: Pairwise Alu sequence alignment (SW) Health: Correlating childhood obesity with environmental factors Cheminformatics: Mapping PubChem data into low dimensions to aid drug discovery
Data mining AlgorithmClustering (Pairwise , Vector)MDS, GTM, PCA, CCA
VisualizationPlotViz
SALSA
FutureGrid Architecture
SALSA
Cloud Computing: Infrastructure and Runtimes
• Cloud infrastructure: outsourcing of servers, computing, data, file space, etc.– Handled through Web services that control virtual machine
lifecycles.• Cloud runtimes: tools (for using clouds) to do data-parallel
computations. – Apache Hadoop, Google MapReduce, Microsoft Dryad, and others – Designed for information retrieval but are excellent for a wide
range of science data analysis applications– Can also do much traditional parallel computing for data-mining if
extended to support iterative operations– Not usually on Virtual Machines
SALSA
Data Intensive Architecture
Prepare for Viz
MDS
InitialProcessing
Instruments
User Data
Users
Database
Database
Database
Database
Files
Files
Database
Database
Database
Database
Files
Files
Database
Database
Database
Database
Files
Files
Higher LevelProcessingSuch as R
PCA, ClusteringCorrelations …
Maybe MPI
VisualizationUser PortalKnowledgeDiscovery
SALSA
MapReduce “File/Data Repository” Parallelism
Instruments
Disks
Computers/Disks
Map1 Map2 Map3 Reduce
Communication via Messages/Files
Map = (data parallel) computation reading and writing dataReduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram
Portals/Users
SALSA
Cluster ConfigurationsFeature GCB-K18 @ MSR iDataplex @ IU Tempest @ IUCPU Intel Xeon
CPU L5420 2.50GHz
Intel Xeon CPU L5420 2.50GHz
Intel Xeon CPU E7450 2.40GHz
# CPU /# Cores per node
2 / 8 2 / 8 4 / 24
Memory 16 GB 32GB 48GB
# Disks 2 1 2
Network Giga bit Ethernet Giga bit Ethernet Giga bit Ethernet /20 Gbps Infiniband
Operating System Windows Server Enterprise - 64 bit
Red Hat Enterprise Linux Server -64 bit
Windows Server Enterprise - 64 bit
# Nodes Used 32 32 32
Total CPU Cores Used 256 256 768
DryadLINQ Hadoop/ Dryad / MPI DryadLINQ / MPI
SALSA
Alu Sequencing Workflow
• Data is a collection of N sequences – 100’s of characters long– These cannot be thought of as vectors because there are missing characters– “Multiple Sequence Alignment” (creating vectors of characters) doesn’t seem
to work if N larger than O(100)• First calculate N2 dissimilarities (distances) between sequences (all pairs)• Find families by clustering (much better methods than Kmeans). As no vectors, use
vector free O(N2) methods• Map to 3D for visualization using Multidimensional Scaling MDS – also O(N2)• N = 50,000 runs in 10 hours (all above) on 768 cores• Our collaborators just gave us 170,000 sequences and want to look at 1.5 million –
will develop new algorithms!• MapReduce++ will do all steps as MDS, Clustering just need MPI Broadcast/Reduce
SALSA
Gene Family from Alu Sequencing
• Calculate pairwise distances for a collection of genes (used for clustering, MDS)
• O(N^2) problem • “Doubly Data Parallel” at Dryad Stage• Performance close to MPI• Performed on 768 cores (Tempest Cluster)
35339 500000
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
DryadLINQMPI
1250 million distances4 hours & 46 minutes
Processes work better than threads when used inside vertices 100% utilization vs. 70%
SALSA
Alu Sequencing WorkflowData is a collection of N sequences – 100’s of characters long
These cannot be thought of as vectors because there are missing characters
“Multiple Sequence Alignment” (creating vectors of characters) doesn’t seem to work if N larger than O(100)
Can calculate N2 dissimilarities (distances) between sequences (all pairs)
Find families by clustering (much better methods than Kmeans). As no vectors, use vector free O(N2) methods
SALSA
SALSA
1 2 4 4 4 8 8 8 8 8 8 8 16 16 16 16 16 24 32 32 48 48 48 48 48 64 64 64 64 96 96128
128192
288384
384480
576672
744
-1
0
1
2
3
4
5
6
MPIMPI
MPI
Parallel Overhead
ThreadThread
Thread
Parallelism
Clustering by Deterministic Annealing
ThreadThread
Thread
MPI
Thread
Pairwise Clustering30,000 Points on Tempest
SALSA
Dryad versus MPI for Smith Waterman
0
1
2
3
4
5
6
7
0 10000 20000 30000 40000 50000 60000
Tim
e pe
r dis
tanc
e ca
lcul
ation
per
core
(m
ilise
cond
s)
Sequeneces
Performance of Dryad vs. MPI of SW-Gotoh Alignment
Dryad (replicated data)
Block scattered MPI (replicated data)Dryad (raw data)
Space filling curve MPI (raw data)Space filling curve MPI (replicated data)
Flat is perfect scaling
SALSA
Dryad Scaling on Smith Waterman
0
1
2
3
4
5
6
7
288 336 384 432 480 528 576 624 672 720
Tim
e pe
r dis
tanc
e ca
lcul
ation
per
core
(m
illis
econ
ds)
Cores
DryadLINQ Scaling Test on SW-G Alignment
Flat is perfect scaling
SALSA
Dryad for Inhomogeneous Data
Flat is perfect scaling – measured on Tempest
1100
1150
1200
1250
1300
1350
0 50 100 150 200 250 300 350
Tim
e (s
)
Standard Deviation of sequence lengths
Tim
e (m
s)
Sequence Length Standard Deviation
Mean Length 400
SALSA
Hadoop/Dryad ComparisonInhomogeneous Data
0 50 100 150 200 250 300 3501200
1300
1400
1500
1600
1700
1800Time
Sequence Length Standard Deviation
Mean Length 400
Hadoop
Dryad
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on IDataplex
SALSA
Hadoop/Dryad Comparison“Homogeneous” Data
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on IdataplexUsing real data with standard deviation/length = 0.1
30000 35000 40000 45000 50000 550000
0.002
0.004
0.006
0.008
0.01
0.012
Number of Sequences
Tim
e pe
r Alig
nmen
t (m
s)
Dryad
Hadoop
SALSA
DryadLINQ on Cloud
• HPC release of DryadLINQ requires Windows Server 2008• Amazon does not provide this VM yet• Used GoGrid cloud provider• Before Running Applications
– Create VM image with necessary software• E.g. NET framework
– Deploy a collection of images (one by one – a feature of GoGrid)– Configure IP addresses (requires login to individual nodes)– Configure an HPC cluster– Install DryadLINQ– Copying data from “cloud storage”
We configured a 32 node virtual cluster in GoGrid
SALSA
DryadLINQ on Cloud contd..
• CloudBurst and Kmeans did not run on cloud• VMs were crashing/freezing even at data partitioning
– Communication and data accessing simply freeze VMs– VMs become unreachable
• We expect some communication overhead, but the above observations are more GoGrid related than to Cloud
• CAP3 works on cloud• Used 32 CPU cores • 100% utilization of virtual
CPU cores• 3 times more time in cloud
than the bare-metal runs on different
• FutureGrid would give us much better results
SALSA
MPI on Clouds Kmeans Clustering
• Perform Kmeans clustering for up to 40 million 3D data points• Amount of communication depends only on the number of cluster centers• Amount of communication << Computation and the amount of data processed• At the highest granularity VMs show at least 3.5 times overhead compared to
bare-metal• Extremely large overheads for smaller grain sizes
Performance – 128 CPU cores Overhead
SALSA
Application Classes • Application—parallel software/hardware in terms of 5 “Application
Architecture” Structures– 1) Synchronous – Lockstep Operation as in SIMD architectures– 2) Loosely Synchronous – Iterative Compute-Communication stages with independent compute
(map) operations for each CPU. Heart of most MPI jobs– 3) Asynchronous – Compute Chess; Combinatorial Search often supported by dynamic threads– 4) Pleasingly Parallel – Each component independent – in 1988, Fox estimated at 20% of total
number of applications– 5) Metaproblems – Coarse grain (asynchronous) combinations of classes 1)-4). The preserve of
workflow.
• Grids greatly increased work in classes 4) and 5)• Previous parallel computing work largely described simulations and not data
processing. Now we should admit the class which crosses classes 2) 4) 5) above– 6) MapReduce++ which describe file(database) to file(database) operations– 6a) Pleasing Parallel Map Only (cap3, HEP)– 6b) Map followed by reductions (SWG)– 6c) Iterative “Map followed by reductions” – Extension of Current Technologies that supports
much linear algebra and datamining (pairwise, MDS)
• Note overheads in 1) 2) 6c) go like Communication Time/Calculation Time and basic MapReduce pays file read/write costs while MPI is microseconds
SALSA
Applications & Different Interconnection PatternsMap Only Classic
MapReduceIterative Reductions Loosely
Synchronous
CAP3 AnalysisDocument conversion (PDF -> HTML)Brute force searches in cryptographyParametric sweeps
High Energy Physics (HEP) HistogramsDistributed searchDistributed sortingInformation retrieval
Expectation maximization algorithmsClusteringLinear Algebra
Many MPI scientific applications utilizing wide variety of communication constructs including local interactions
- CAP3 Gene Assembly- PolarGrid Matlab data analysis
- Information Retrieval - HEP Data Analysis- Calculation of Pairwise Distances for ALU Sequences
- Kmeans - Deterministic Annealing Clustering- Multidimensional Scaling MDS
- Solving Differential Equations and - particle dynamics with short range forces
Input
Output
map
Inputmap
reduce
Inputmap
reduce
iterations
Pij
Domain of MapReduce and Iterative Extensions MPI
SALSA
Summary: Key Features of our Approach
• Cloud technologies work very well for data intensive applications • Iterative MapReduce allows to build a complete system with single cloud technology
without MPI • FutureGrid allows easy Windows v Linux with and without VM comparison• Intend to implement range of biology applications with Dryad/Hadoop• Initially we will make key capabilities available as services that we eventually
implement on virtual clusters (clouds) to address very large problems– Basic Pairwise dissimilarity calculations– R (done already by us and others)– MDS in various forms– Vector and Pairwise Deterministic annealing clustering
• Point viewer (Plotviz) either as download (to Windows!) or as a Web service• Note much of our code written in C# (high performance managed code) and runs on
Microsoft HPCS 2008 (with Dryad extensions)– Hadoop code written in Java
SALSA
Project website
www.infomall.org/SALSA