Scientific Data Analytics on Cloud and HPC Platforms
-
Upload
evelyn-stanton -
Category
Documents
-
view
23 -
download
2
description
Transcript of Scientific Data Analytics on Cloud and HPC Platforms
Scientific Data Analytics on Cloud and HPC Platforms
SALSA HPC Group http://salsahpc.indiana.edu
School of Informatics and Computing
Indiana University
Judy Qiu
CAREER Award
04/19/2023 Bill Howe, eScience Institute 2
"... computing may someday be organized as a public utility just asthe telephone system is a public utility... The computer utility couldbecome the basis of a new and important industry.”
Emeritus at Stanford
Inventor of LISP
-- John McCarthy
1961
Challenges and Opportunities
• Iterative MapReduce – A Programming Model instantiating the paradigm of
bringing computation to data – Supporting for Data Mining and Data Analysis
• Interoperability– Using the same computational tools on HPC and Cloud– Enabling scientists to focus on science not
programming distributed systems• Reproducibility
– Using Cloud Computing for Scalable, Reproducible Experimentation
– Sharing results, data, and software
SALSA
(Iterative) MapReduce in Context
Linux HPCBare-system
Amazon Cloud Windows Server HPC
Bare-system Virtualization
Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)
Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping
CPU Nodes
Virtualization
Applications
Programming Model
Infrastructure
Hardware
Azure Cloud
Security, Provenance, Portal
High Level Language
Distributed File Systems Data Parallel File System
Grid Appliance
GPU Nodes
Support Scientific Simulations (Data Mining and Data Analysis)
Runtime
Storage
Services and Workflow
Object Store
SALSA
Map Reduce
Programming Model
Moving Computation
to Data
Scalable
Fault Tolerance
Ideal for data intensive loosely coupled (pleasingly parallel) applications
• Twister[1]
– Map->Reduce->Combine->Broadcast– Long running map tasks (data in memory)– Centralized driver based, statically scheduled.
• Daytona[3]
– Iterative MapReduce on Azure using cloud services– Architecture similar to Twister
• Haloop[4]
– On disk caching, Map/reduce input caching, reduce output caching
• Spark[5]
– Iterative Mapreduce Using Resilient Distributed Dataset to ensure the fault tolerance
• Pregel[6]
– Graph processing from Google
Iterative MapReduce Frameworks
Others• Mate-EC2[6]
– Local reduction object• Network Levitated Merge[7]
– RDMA/infiniband based shuffle & merge• Asynchronous Algorithms in MapReduce[8]
– Local & global reduce • MapReduce online[9]
– online aggregation, and continuous queries– Push data from Map to Reduce
• Orchestra[10]
– Data transfer improvements for MR• iMapReduce[11]
– Async iterations, One to one map & reduce mapping, automatically joins loop-variant and invariant data
• CloudMapReduce[12] & Google AppEngine MapReduce[13]
– MapReduce frameworks utilizing cloud infrastructure services
Twister v0.9
Distinction on static and
variable data
Configurable long running
(cacheable) map/reduce tasks
Pub/sub messaging based
communication/data transfers
Broker Network for facilitating
communication
New Infrastructure for Iterative MapReduce Programming
configureMaps(..)
configureReduce(..)
runMapReduce(..)
while(condition){
} //end while
updateCondition()
close()
Combine() operation
Reduce()
Map()
Worker Nodes
Communications/data transfers via the pub-sub broker network & direct TCP
Iterations
May send <Key,Value> pairs directly
Local Disk
Cacheable map/reduce tasks
• Main program may contain many MapReduce invocations or iterative MapReduce invocations
Main program’s process space
Worker Node
Local Disk
Worker Pool
Twister Daemon
Master Node
Twister Driver
Main Program
B
BB
B
Pub/sub Broker Network
Worker Node
Local Disk
Worker Pool
Twister Daemon
Scripts perform:Data distribution, data collection, and partition file creation
map
reduce Cacheable tasks
One broker serves several Twister daemons
Applications of Twister4Azure• Implemented– Multi Dimensional Scaling– KMeans Clustering– PageRank– SmithWatermann-GOTOH sequence alignment– WordCount– Cap3 sequence assembly– Blast sequence search– GTM & MDS interpolation
• Under Development– Latent Dirichlet Allocation
Twister4Azure Architecture
Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.
Data Intensive Iterative Applications
• Growing class of applications– Clustering, data mining, machine learning & dimension
reduction applications– Driven by data deluge & emerging computation fields
Compute Communication Reduce/ barrier
New Iteration
Larger Loop-Invariant Data
Smaller Loop-Variant Data
Broadcast
Reduce
Reduce
MergeAdd
Iteration? No
Map Combine
Map Combine
Map Combine
Data Cache
Yes
Hybrid scheduling of the new iteration
Job Start
Job Finish
Iterative MapReduce for Azure Cloud
Merge step
http://salsahpc.indiana.edu/twister4azure
Extensions to support broadcast data
Multi-level caching of static data
Hybrid intermediate data transfer
Cache-aware Hybrid Task Scheduling
Collective Communication
Primitives
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure, Thilina Gunarathne, BingJing Zang, Tak-Lon Wu and Judy Qiu, (UCC 2011) , Melbourne, Australia.
Performance of Pleasingly Parallel Applications on Azure
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
128 228 328 428 528 628 728
Para
llel E
ffici
ency
Number of Query Files
Twister4Azure
Hadoop-Blast
DryadLINQ-Blast
BLAST Sequence Search
50%55%60%65%70%75%80%85%90%95%
100%
Par
alle
l Effi
cie
ncy
Num. of Cores * Num. of Files
Twister4Azure
Amazon EMR
Apache Hadoop
Cap3 Sequence Assembly
Smith Watermann Sequence Alignment
MapReduce in the Clouds for Science, Thilina Gunarathne, et al. CloudCom 2010, Indianapolis, IN.
32 64 96 128 160 192 224 2560
0.2
0.4
0.6
0.8
1
1.2
Twister4Azure Twister Hadoop
Number of Instances/Cores
Rela
tive
Par
alle
l Effi
cien
cy
Performance – Kmeans Clustering
Number of Executing Map Task Histogram
Strong Scaling with 128M Data PointsWeak Scaling
Task Execution Time Histogram
First iteration performs the initial data fetch
Overhead between iterations
Scales better than Hadoop on bare metal
32 x 32 M
64 x 64 M
96 x 96 M
128 x 128 M
192 x 192 M
256 x 256 M
0100200300400500600700800900
1,000
Twister4Azure Adjusted
Num Nodes x Num Data Points
Tim
e (m
s)
Performance – Multi Dimensional Scaling
Weak Scaling Data Size Scaling
Performance adjusted for sequential performance difference
X: Calculate invV (BX)Map Reduce Merge
BC: Calculate BX Map Reduce Merge
Calculate StressMap Reduce Merge
New Iteration
Scalable Parallel Scientific Computing Using Twister4Azure. Thilina Gunarathne, BingJing Zang, Tak-Lon Wu and Judy Qiu. Submitted to Journal of Future Generation Computer Systems. (Invited as one of the best 6 papers of UCC 2011)
Parallel Data Analysis using Twister
• MDS (Multi Dimensional Scaling)• Clustering (Kmeans) • SVM (Scalable Vector Machine)• Indexing
Xiaoming Gao, Vaibhav Nachankar and Judy Qiu, Experimenting Lucene Index on HBase in an HPC Environment, position paper in the proceedings of ACM High Performance Computing meets Databases workshop (HPCDB'11) at SuperComputing 11, December 6, 2011
MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute
Twister-MDS OutputApplication #1
Data Intensive Kmeans Clustering─ Image Classification: 1.5 TB; 500 features per image;10k clusters 1000 Map tasks; 1GB data transfer per Map task
Application #2
Broadcasting Data could be large Chain & MST
Map Collectives Local merge
Reduce Collectives Collect but no merge
Combine Direct download or
Gather
Map Tasks Map Tasks
Map Collective
Reduce Tasks
Reduce Collective
Gather
Map Collective
Reduce Tasks
Reduce Collective
Map Tasks
Map Collective
Reduce Tasks
Reduce Collective
BroadcastTwister Communications
Polymorphic Scatter-Allgather in Twister
0 20 40 60 80 100 120 1400
5
10
15
20
25
30
35
Multi-Chain Scatter-Allgather-BKTScatter-Allgather-MST Scatter-Allgather-Broker
Number of Nodes
Tim
e (U
nit:
Seco
nds)
Twister Performance on Kmeans Clustering
Per Iteration Cost (Before) Per Iteration Cost (After)0
50100150200250300350400450
Combine Shuffle & ReduceMap Broadcast
Tim
e (U
nit:
Seco
nds)
Twister on InfiniBand• InfiniBand successes in HPC community
– More than 42% of Top500 clusters use InfiniBand– Extremely high throughput and low latency
• Up to 40Gb/s between servers and 1μsec latency
– Reduce CPU overhead up to 90%• Cloud community can benefit from InfiniBand
– Accelerated Hadoop (sc11)– HDFS benchmark tests
• RDMA can make Twister faster– Accelerate static data distribution– Accelerate data shuffling between mappers and reducer
• In collaboration with ORNL on a large InfiniBand cluster
Twister Broadcast Comparison: Ethernet vs. InfiniBand
0
5
10
15
20
25InfiniBand Speed Up Chart – 1GB bcast
Ethernet InfiniBand
Seco
nd
32
Building Virtual ClustersTowards Reproducible eScience in the Cloud
Separation of concerns between two layers• Infrastructure Layer – interactions with the Cloud API• Software Layer – interactions with the running VM
33
Separation Leads to ReuseInfrastructure Layer = (*) Software Layer = (#)
By separating layers, one can reuse software layer artifacts in separate clouds
34
Design and Implementation
Equivalent machine images (MI) built in separate clouds• Common underpinning in separate clouds for software
installations and configurations
• Configuration management used for software automation
Extend to Azure
35
Cloud Image Proliferation
ahass
any
andbos
ashley
-imag
e-bucke
t
buzztro
ll
centos5
3
centos5
6
cidtes
timage
clovr
debian
-rm1984
dikim-fe
dora-bucke
t
fedora-
imag
e-bucke
t
fedora-
mex-im
age-b
ucket
grid-ap
pliance
grid-ap
pliance-
test1
gridap
pliance-
twist
er
imag
e-bucke
t-gera
ldjdiaz
jklingin
mybucke
t
myimag
e
p434-ubuntu.9.04-imag
e-bucke
t
pegasu
s-imag
es
provis
ion
saga-m
r-euca-
bucket
SGXIm
age
smad
di2-bfast-b
j
tbucket
try-xe
n
ubuntu-imag
e-bucke
t
ubuntu-MEX
-imag
e-bucke
t
ubuntu904wch
en
wchen
-serve
r-stag
e-1 yye
02468
101214
FG Eucalyptus Images per Bucket (N = 120)
37
Implementation - Hadoop Cluster
Hadoop cluster commands• knife hadoop launch {name} {slave count}• knife hadoop terminate {name}
38
Running CloudBurst on Hadoop
Running CloudBurst on a 10 node Hadoop Cluster• knife hadoop launch cloudburst 9• echo ‘{"run list": "recipe[cloudburst]"}' > cloudburst.json• chef-client -j cloudburst.json
10 20 500
50
100
150
200
250
300
350
400CloudBurst Sample Data Run-Time Results
FilterAlignmentsCloudBurst
Cluster Size (node count)
Run
Tim
e (s
econ
ds)
CloudBurst on a 10, 20, and 50 node Hadoop Cluster
Applications & Different Interconnection PatternsMap Only Classic
MapReduceIterative MapReduce
TwisterLoosely
Synchronous
CAP3 AnalysisDocument conversion (PDF -> HTML)Brute force searches in cryptographyParametric sweeps
High Energy Physics (HEP) HistogramsSWG gene alignmentDistributed searchDistributed sortingInformation retrieval
Expectation maximization algorithmsClusteringLinear Algebra
Many MPI scientific applications utilizing wide variety of communication constructs including local interactions
- CAP3 Gene Assembly- PolarGrid Matlab data analysis
- Information Retrieval - HEP Data Analysis- Calculation of Pairwise Distances for ALU Sequences
- Kmeans - Deterministic Annealing Clustering- Multidimensional Scaling MDS
- Solving Differential Equations and - particle dynamics with short range forces
Input
Output
map
Inputmap
reduce
Inputmap
reduce
iterations
Pij
Domain of MapReduce and Iterative Extensions MPI
Ackowledgements
SALSA HPC Group http://salsahpc.indiana.edu
School of Informatics and Computing
Indiana University