Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis...

44
Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica Darema PIs: Mohammad Maifi Khan, Swapna Gokhale Department of Computer Science and Engineering University of Connecticut

Transcript of Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis...

Page 1: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Performance Analysis and Diagnosis of Cloud-based DDDAS Applications

FA 9550-15-1-0184

Program Director: Dr. Frederica Darema

PIs: Mohammad Maifi Khan, Swapna Gokhale Department of Computer Science and Engineering

University of Connecticut

Page 2: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica
Page 3: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Dat

a H

and

ling

Serv

ice

Data Analytic Application - 1

Data Analytic Application - 2

Data Analytic Application - N

Video

Audio

Sound, Light, Temperature

Vibration, Accelerometer

Twitter feed

Data Replication

Service

Data Lookup Service

Failure Recovery Service

Cloud Data Storage Service

End User Visualization

System Layer

Cloud-based Storage Layer

Application Layer

……………..

Page 4: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Predictable Performance is of utmost importance!

Page 5: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Challenges

• The sampling rate of different sensors and sensing modalities may change – Cascading effect on the cloud side

• Different data processing algorithms requiring different sets of resources

(e.g., motif mining vs. image analysis)

• Virtualization, wide adoption of parallel and multi-threaded programs, and increasingly larger scales – High degree of interactive complexity

• We need a solution that can answer questions such as -

– How is the changed sampling rate going to affect the execution? – Why is allocating more servers not improving the execution time? – How long is it going to take to finish the job? – …..

Page 6: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Performance Monitor Resource Allocation

Service

Performance Monitor

Performance Monitor

Performance Modeling Framework

Sensor Stream Configurations

System Configurations

Expected Performance

Actual Performance

Diagnostic Service

Target Performance

Page 7: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

What is expected to happen in response to critical events?

Page 8: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Highly Important Nodes

Highly Important Nodes

What is expected to happen in response to critical events?

Page 9: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

System Architecture

Page 10: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

• Each node has a non-negative Importance function:

𝐼𝑖 𝑡 = 1

𝑁

𝑖=1

N: number of sensors, t: time point

• Each node i has a non-negative Quality of information function QoI(𝑠𝑖) ∈ 0,1 : QoI(𝑠𝑖) = 0, 𝑠𝑖 ≤ 𝑚𝑖𝑛𝑖 QoI 𝑠𝑖 = 1, 𝑠𝑖 ≥ 𝑀𝑎𝑥𝑖

• QoI(minimum) = f(importance)

Jin,J.,Palaniswami,M.,Krishnamachari,B.,2012.Ratecontrol for heterogeneous wireless sensor networks: characterization, algorithms and performance. Computer Networks 56(17),3783–3794.

Page 11: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Overview • System Capacity(SC): the

capacity of the bottleneck resource in the system:

𝑠𝑖 . 𝑚𝑠𝑖 . 𝑥𝑖 ≤ 𝑆𝐶

𝑁

𝑖=1

𝑠𝑖 : sampling rate of node i

𝑚𝑠𝑖: message size

𝑥𝑖 ∈ {0, 1}: node status

maximize

𝑄𝑜𝐼𝑖 𝑠𝑖 . 𝑥𝑖

𝑁

𝑖=1

s.t:

𝐼𝑖 𝑡 = 1

𝑁

𝑖=1

𝑠𝑖 . 𝑚𝑠𝑖 . 𝑥𝑖 ≤ 𝑆𝐶

𝑁

𝑖=1

𝑥𝑖 ∈ {0, 1}: node status

Page 12: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Knapsack Rate Allocation Algorithm ?

• Sampling rate assignment that maximizes the QoI for the whole network may not be the ideal solution

– Node with the highest importance may have

smaller 𝑄𝑜𝐼𝑖

𝑠𝑖.𝑚𝑠𝑖 due to large message size

Page 13: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Preliminary Approach

• Use threshold to separate nodes into two (or more ) groups: – Nodes that have importance higher than the

threshold

– The remaining nodes

• Apply Knapsack rate allocation algorithm for each group separately

• First, pick nodes from the critical group

• Next, pick nodes from the less critical group (if system capacity allows)

Page 14: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Simulated network topology

1 12

15

10

2 4

19

3

11

13 16

6

8

21

9 18

23 20

7 14

22

5

17

24

0

Group 1 Group 2

Group 3

Group 4 Group 5

Group 6

Group 7

Page 15: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Evaluation

0

0.05

0.1

0.15

0.2

0.25

0.3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Imp

ort

ance

of

no

de

(%)

Time (minute)

Node 1 (Group 1)

Node 12 (Group 1)

Node 15 (Group 1)

Node 7 (Group 5)

Node 14 (Group 5)

Node 22 (Group 5)

Node 6 (Group 7)

Node 8 (Group 7)

Node 21 (Group 7)

Duration of Event 1

in Group 1

Duration of Event 2

in Group 5

Duration of Event 3

in Group 7

The Importance Metrics of Sensor Nodes

Page 16: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Quality of Information vs. Network Coverage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3

Qu

alit

y o

f In

form

atio

n (

%)

Quality of Information for the Entire Network Number of Suspended Nodes

0

2

4

6

8

10

12

14

16

1 2 3

Nu

mb

er

of

susp

en

de

d n

od

es

RateAllocationAlgorithm

Threshold BasedRateAllocationAlgorithm

MaximumGreedyAlgorithm

Time of different experiment

Time of different experiment

Page 17: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

How does changing sampling rate affect backend?

• If a server gets overloaded due to changed sampling rate, we may lose critical data

• One way to address this is by redirecting certain sensor streams to different servers on the fly

• The challenge is

– How to determine possible overload in advance?

– How to determine the set of sensors that need to be redirected?

Page 18: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica
Page 19: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica
Page 20: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica
Page 21: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica
Page 22: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Modeling Performance of Application layer jobs

Page 23: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

A Different Platform

Apache Spark™ is a widely used cloud-based

platform for large-scale data processing.[1,2]

Resilient distributed datasets (RDDs) feature supports

in-memory computation

[1] Apache Spark™. http://spark.apache.org/.

[2] Powered By Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.

23

Page 24: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

24

Page 25: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Initial Idea

Execute sample job, measure the performance of

the job

Predict the performance of the actual job based

on the performance of the sample job

25

1

2

Run the same program as actual job, but

only with a fraction of the input data

Page 26: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

26

Execute

Sample Job

Performance

Model

Sampled

Input

Event

Logs

Performance

Info

Predict

Performance

of the Actual

Job

Page 27: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Apache Spark Job

• One job consists of sequential stages

• One stage contains parallel and sequential tasks

• Tasks run in batches

One batch of P tasks run in parallel

, H is the number of working nodes

M is the number of stages and N is the number of tasks in a stage

27

Page 28: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Performance Metrics

28

Execution time

Kc is the number of sequential tasks running in CPU core c, P is the total

number of core.

Page 29: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

• Average execution time of the first batch is different

from the subsequent batches within the same stage

29

Here nh is the number of tasks running in host h, and Ph is the number of tasks in the first batch

Page 30: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Experimental Setup

30

Cluster Setup

Page 31: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

31

One Batch Two Batches

Reduced Scale 1.25GB 2.5GB

Full Scale 7.5GB 15GB

Example Jobs

Sample Input

Job Input

WordCount 75GB Wikipedia Dump

Logistic Regression 50GB SDSS CMD Data

K-Means 50GB SDSS CMD Data

PageRank 25GB SNAP Network Dataset

- Sloan digital sky survey. http://www.sdss.org/.

- Stanford snap. http://snap.stanford.edu/.

Page 32: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Predication Accuracy Calculation

• Prediction accuracy is calculated for each stage and

summed up as follows:

, M is number of stages

32

Page 33: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

App – I: WordCount

33

Prediction accuracy Time Prediction

I/O Write Prediction I/O Read Prediction

Page 34: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

App – II: Logistic Regression

34

Prediction accuracy Time Prediction

Page 35: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

App – III: K-Means

35

Prediction accuracy Time Prediction

I/O Write Prediction I/O Read Prediction

Page 36: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

App – IV: PageRank

36

Prediction accuracy Time Prediction

I/O Write Prediction I/O Read Prediction

Page 37: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Can we model Interference among multiple Jobs?

Page 38: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Main Idea

• Model the slowdown ratio using different kinds of job mix (CPU bound, I/O bound)

• Next, based on stage models, estimate the impact on execution time per stage

• Account for the cascading effect on execution to predict the total execution time

Page 39: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Preliminary Evaluation

• We choose four Apache Spark jobs – PageRank – K-Means – Logistic Regression – Word Count

• For PageRank, we use the 20 GB LiveJournal network dataset from SNAP

• K-Means and Logistic Regression applications use 20 GB of numerical Color-Magnitude Diagram data of galaxy from Sloan Digital Sky Survey (SDSS)

• WordCount application uses 20 GB Wikipedia dump data

Page 40: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica
Page 41: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Next Goal

• Modeling the interference among multiple jobs and validating the model

• Developing Algorithms for Interference Aware Job Scheduling

Page 42: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Long Term Goal

- Leveraging performance models for performance troubleshooting - Modeling interference between application layer and storage layer - Scalable instrumentation - Scalable troubleshooting algorithms

Page 43: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

Publications Relevant to this Project

• Published – Performance Prediction for Apache Spark Platform. Kewen Wang and

Mohammad Maifi Hasan Khan. In proceedings of 17th IEEE International Conference on High Performance Computing and Communications (HPCC), 2015.

– A closed-loop context aware data acquisition and resource allocation framework for dynamic data driven applications systems (DDDAS) on the cloud. Nguyen, Nhan, and Mohammad Maifi Hasan Khan. Journal of Systems and Software, Elsevier, 2015.

– Context aware data acquisition framework for dynamic data driven applications systems (DDDAS). Nhan Nguyen, Mohammad Maifi Hasan Khan. In proceedings of the 32nd IEEE Military Communication Conference (MILCOM), San Diego, CA, USA, 2013.

• Under Preparation – Modeling interference on Apache Spark Platform – Interference aware job scheduling for Apache Spark Platform

Page 44: Performance Analysis and Diagnosis of Cloud-based DDDAS ... · Performance Analysis and Diagnosis of Cloud-based DDDAS Applications FA 9550-15-1-0184 Program Director: Dr. Frederica

THANK YOU!

Please feel free to contact if you have any questions.

Mohammad Maifi Khan <[email protected]> Swapna Gokhale <[email protected]>