Scientific Data Analytics on Cloud and HPC Platforms

43
Scientific Data Analytics on Cloud and HPC Platforms SALSA HPC Group http:// salsahpc.indiana.edu School of Informatics and Computing Indiana University Judy Qiu CAREER Award

description

Scientific Data Analytics on Cloud and HPC Platforms. Judy Qiu. S A L S A HPC Group http:// salsahpc.indiana.edu School of Informatics and Computing Indiana University. CAREER Award. "... computing may someday be organized as a public utility just as - PowerPoint PPT Presentation

Transcript of Scientific Data Analytics on Cloud and HPC Platforms

Scientific Data Analytics on Cloud and HPC Platforms

SALSA HPC Group http://salsahpc.indiana.edu

School of Informatics and Computing

Indiana University

Judy Qiu

CAREER Award

04/19/2023 Bill Howe, eScience Institute 2

"... computing may someday be organized as a public utility just asthe telephone system is a public utility... The computer utility couldbecome the basis of a new and important industry.”

Emeritus at Stanford

Inventor of LISP

-- John McCarthy

1961

Joseph L. Hellerstein, Google

Challenges and Opportunities

• Iterative MapReduce – A Programming Model instantiating the paradigm of

bringing computation to data – Supporting for Data Mining and Data Analysis

• Interoperability– Using the same computational tools on HPC and Cloud– Enabling scientists to focus on science not

programming distributed systems• Reproducibility

– Using Cloud Computing for Scalable, Reproducible Experimentation

– Sharing results, data, and software

SALSAIntel’s Application Stack

SALSA

(Iterative) MapReduce in Context

Linux HPCBare-system

Amazon Cloud Windows Server HPC

Bare-system Virtualization

Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)

Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping

CPU Nodes

Virtualization

Applications

Programming Model

Infrastructure

Hardware

Azure Cloud

Security, Provenance, Portal

High Level Language

Distributed File Systems Data Parallel File System

Grid Appliance

GPU Nodes

Support Scientific Simulations (Data Mining and Data Analysis)

Runtime

Storage

Services and Workflow

Object Store

SALSA

Map Reduce

Programming Model

Moving Computation

to Data

Scalable

Fault Tolerance

Ideal for data intensive loosely coupled (pleasingly parallel) applications

8

MapReduce in Heterogeneous Environment

MICROSOFT

• Twister[1]

– Map->Reduce->Combine->Broadcast– Long running map tasks (data in memory)– Centralized driver based, statically scheduled.

• Daytona[3]

– Iterative MapReduce on Azure using cloud services– Architecture similar to Twister

• Haloop[4]

– On disk caching, Map/reduce input caching, reduce output caching

• Spark[5]

– Iterative Mapreduce Using Resilient Distributed Dataset to ensure the fault tolerance

• Pregel[6]

– Graph processing from Google

Iterative MapReduce Frameworks

Others• Mate-EC2[6]

– Local reduction object• Network Levitated Merge[7]

– RDMA/infiniband based shuffle & merge• Asynchronous Algorithms in MapReduce[8]

– Local & global reduce • MapReduce online[9]

– online aggregation, and continuous queries– Push data from Map to Reduce

• Orchestra[10]

– Data transfer improvements for MR• iMapReduce[11]

– Async iterations, One to one map & reduce mapping, automatically joins loop-variant and invariant data

• CloudMapReduce[12] & Google AppEngine MapReduce[13]

– MapReduce frameworks utilizing cloud infrastructure services

Twister v0.9

Distinction on static and

variable data

Configurable long running

(cacheable) map/reduce tasks

Pub/sub messaging based

communication/data transfers

Broker Network for facilitating

communication

New Infrastructure for Iterative MapReduce Programming

configureMaps(..)

configureReduce(..)

runMapReduce(..)

while(condition){

} //end while

updateCondition()

close()

Combine() operation

Reduce()

Map()

Worker Nodes

Communications/data transfers via the pub-sub broker network & direct TCP

Iterations

May send <Key,Value> pairs directly

Local Disk

Cacheable map/reduce tasks

• Main program may contain many MapReduce invocations or iterative MapReduce invocations

Main program’s process space

Worker Node

Local Disk

Worker Pool

Twister Daemon

Master Node

Twister Driver

Main Program

B

BB

B

Pub/sub Broker Network

Worker Node

Local Disk

Worker Pool

Twister Daemon

Scripts perform:Data distribution, data collection, and partition file creation

map

reduce Cacheable tasks

One broker serves several Twister daemons

Applications of Twister4Azure• Implemented– Multi Dimensional Scaling– KMeans Clustering– PageRank– SmithWatermann-GOTOH sequence alignment– WordCount– Cap3 sequence assembly– Blast sequence search– GTM & MDS interpolation

• Under Development– Latent Dirichlet Allocation

Twister4Azure Architecture

Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.

Data Intensive Iterative Applications

• Growing class of applications– Clustering, data mining, machine learning & dimension

reduction applications– Driven by data deluge & emerging computation fields

Compute Communication Reduce/ barrier

New Iteration

Larger Loop-Invariant Data

Smaller Loop-Variant Data

Broadcast

Reduce

Reduce

MergeAdd

Iteration? No

Map Combine

Map Combine

Map Combine

Data Cache

Yes

Hybrid scheduling of the new iteration

Job Start

Job Finish

Iterative MapReduce for Azure Cloud

Merge step

http://salsahpc.indiana.edu/twister4azure

Extensions to support broadcast data

Multi-level caching of static data

Hybrid intermediate data transfer

Cache-aware Hybrid Task Scheduling

Collective Communication

Primitives

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure, Thilina Gunarathne, BingJing Zang, Tak-Lon Wu and Judy Qiu, (UCC 2011) , Melbourne, Australia.

Performance of Pleasingly Parallel Applications on Azure

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

128 228 328 428 528 628 728

Para

llel E

ffici

ency

Number of Query Files

Twister4Azure

Hadoop-Blast

DryadLINQ-Blast

BLAST Sequence Search

50%55%60%65%70%75%80%85%90%95%

100%

Par

alle

l Effi

cie

ncy

Num. of Cores * Num. of Files

Twister4Azure

Amazon EMR

Apache Hadoop

Cap3 Sequence Assembly

Smith Watermann Sequence Alignment

MapReduce in the Clouds for Science, Thilina Gunarathne, et al. CloudCom 2010, Indianapolis, IN.

32 64 96 128 160 192 224 2560

0.2

0.4

0.6

0.8

1

1.2

Twister4Azure Twister Hadoop

Number of Instances/Cores

Rela

tive

Par

alle

l Effi

cien

cy

Performance – Kmeans Clustering

Number of Executing Map Task Histogram

Strong Scaling with 128M Data PointsWeak Scaling

Task Execution Time Histogram

First iteration performs the initial data fetch

Overhead between iterations

Scales better than Hadoop on bare metal

32 x 32 M

64 x 64 M

96 x 96 M

128 x 128 M

192 x 192 M

256 x 256 M

0100200300400500600700800900

1,000

Twister4Azure Adjusted

Num Nodes x Num Data Points

Tim

e (m

s)

Performance – Multi Dimensional Scaling

Weak Scaling Data Size Scaling

Performance adjusted for sequential performance difference

X: Calculate invV (BX)Map Reduce Merge

BC: Calculate BX Map Reduce Merge

Calculate StressMap Reduce Merge

New Iteration

Scalable Parallel Scientific Computing Using Twister4Azure. Thilina Gunarathne, BingJing Zang, Tak-Lon Wu and Judy Qiu. Submitted to Journal of Future Generation Computer Systems. (Invited as one of the best 6 papers of UCC 2011)

MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute

Twister-MDS OutputApplication #1

Data Intensive Kmeans Clustering─ Image Classification: 1.5 TB; 500 features per image;10k clusters 1000 Map tasks; 1GB data transfer per Map task

Application #2

Broadcasting Data could be large Chain & MST

Map Collectives Local merge

Reduce Collectives Collect but no merge

Combine Direct download or

Gather

Map Tasks Map Tasks

Map Collective

Reduce Tasks

Reduce Collective

Gather

Map Collective

Reduce Tasks

Reduce Collective

Map Tasks

Map Collective

Reduce Tasks

Reduce Collective

BroadcastTwister Communications

Improving Performance of Map Collectives

Scatter and AllgatherFull Mesh Broker Network

Polymorphic Scatter-Allgather in Twister

0 20 40 60 80 100 120 1400

5

10

15

20

25

30

35

Multi-Chain Scatter-Allgather-BKTScatter-Allgather-MST Scatter-Allgather-Broker

Number of Nodes

Tim

e (U

nit:

Seco

nds)

Twister Performance on Kmeans Clustering

Per Iteration Cost (Before) Per Iteration Cost (After)0

50100150200250300350400450

Combine Shuffle & ReduceMap Broadcast

Tim

e (U

nit:

Seco

nds)

Twister on InfiniBand• InfiniBand successes in HPC community

– More than 42% of Top500 clusters use InfiniBand– Extremely high throughput and low latency

• Up to 40Gb/s between servers and 1μsec latency

– Reduce CPU overhead up to 90%• Cloud community can benefit from InfiniBand

– Accelerated Hadoop (sc11)– HDFS benchmark tests

• RDMA can make Twister faster– Accelerate static data distribution– Accelerate data shuffling between mappers and reducer

• In collaboration with ORNL on a large InfiniBand cluster

Using RDMA for Twister on InfiniBand

Twister Broadcast Comparison: Ethernet vs. InfiniBand

0

5

10

15

20

25InfiniBand Speed Up Chart – 1GB bcast

Ethernet InfiniBand

Seco

nd

32

Building Virtual ClustersTowards Reproducible eScience in the Cloud

Separation of concerns between two layers• Infrastructure Layer – interactions with the Cloud API• Software Layer – interactions with the running VM

33

Separation Leads to ReuseInfrastructure Layer = (*) Software Layer = (#)

By separating layers, one can reuse software layer artifacts in separate clouds

34

Design and Implementation

Equivalent machine images (MI) built in separate clouds• Common underpinning in separate clouds for software

installations and configurations

• Configuration management used for software automation

Extend to Azure

35

Cloud Image Proliferation

ahass

any

andbos

ashley

-imag

e-bucke

t

buzztro

ll

centos5

3

centos5

6

cidtes

timage

clovr

debian

-rm1984

dikim-fe

dora-bucke

t

fedora-

imag

e-bucke

t

fedora-

mex-im

age-b

ucket

grid-ap

pliance

grid-ap

pliance-

test1

gridap

pliance-

twist

er

imag

e-bucke

t-gera

ldjdiaz

jklingin

mybucke

t

myimag

e

p434-ubuntu.9.04-imag

e-bucke

t

pegasu

s-imag

es

provis

ion

saga-m

r-euca-

bucket

SGXIm

age

smad

di2-bfast-b

j

tbucket

try-xe

n

ubuntu-imag

e-bucke

t

ubuntu-MEX

-imag

e-bucke

t

ubuntu904wch

en

wchen

-serve

r-stag

e-1 yye

02468

101214

FG Eucalyptus Images per Bucket (N = 120)

Changes of Hadoop Versions

37

Implementation - Hadoop Cluster

Hadoop cluster commands• knife hadoop launch {name} {slave count}• knife hadoop terminate {name}

38

Running CloudBurst on Hadoop

Running CloudBurst on a 10 node Hadoop Cluster• knife hadoop launch cloudburst 9• echo ‘{"run list": "recipe[cloudburst]"}' > cloudburst.json• chef-client -j cloudburst.json

10 20 500

50

100

150

200

250

300

350

400CloudBurst Sample Data Run-Time Results

FilterAlignmentsCloudBurst

Cluster Size (node count)

Run

Tim

e (s

econ

ds)

CloudBurst on a 10, 20, and 50 node Hadoop Cluster

Applications & Different Interconnection PatternsMap Only Classic

MapReduceIterative MapReduce

TwisterLoosely

Synchronous

CAP3 AnalysisDocument conversion (PDF -> HTML)Brute force searches in cryptographyParametric sweeps

High Energy Physics (HEP) HistogramsSWG gene alignmentDistributed searchDistributed sortingInformation retrieval

Expectation maximization algorithmsClusteringLinear Algebra

Many MPI scientific applications utilizing wide variety of communication constructs including local interactions

- CAP3 Gene Assembly- PolarGrid Matlab data analysis

- Information Retrieval - HEP Data Analysis- Calculation of Pairwise Distances for ALU Sequences

- Kmeans - Deterministic Annealing Clustering- Multidimensional Scaling MDS

- Solving Differential Equations and - particle dynamics with short range forces

Input

Output

map

Inputmap

reduce

Inputmap

reduce

iterations

Pij

Domain of MapReduce and Iterative Extensions MPI

Ackowledgements

SALSA HPC Group http://salsahpc.indiana.edu

School of Informatics and Computing

Indiana University