Measuring Quality of Service on Worker Node in Cluster

25
15/02/2006 CHEP 06 1 Measuring Quality of Service on Worker Node in Cluster Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division, BARC, Mumbai, India Helge Mainhard, Tony Cass, Olof Barring, CERN Geneva, Switzerland

description

Measuring Quality of Service on Worker Node in Cluster. Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division, BARC, Mumbai, India Helge Mainhard, Tony Cass, Olof Barring, CERN Geneva, Switzerland. INTRODUCTION. Quality of Service - PowerPoint PPT Presentation

Transcript of Measuring Quality of Service on Worker Node in Cluster

Page 1: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 1

Measuring Quality of Service on Worker Node in Cluster

Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division, BARC, Mumbai, India

Helge Mainhard, Tony Cass, Olof Barring, CERN Geneva, Switzerland

Page 2: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 2

INTRODUCTION

Quality of Service Defines goodness of a node for a type

of task Needed for better/optimum utilization

of resources Computer Division, BARC and IT

Division CERN collaborated to explore ways to predict QoS

Page 3: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 3

QoS – Definition

QoS defines, how better the node is for a given task

QoS relates execution times like this

QoS varies between 0 to 1

QoS

TT noloadexecution

Texecution = Wall clock execution time for any taskTnoload = Wall clock execution time of the task on a given node without loadQoS = Quality of Service

Page 4: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 4

Methodology

Three task categories CPU intensive Disk IO intensive Network IO intensive

Representative probe programs for each category

Load generating program for each category

Page 5: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 5

Methodology

Monitor system metrics Load avg, CPU utilization, Memory utilization,

disk utilization, swap utilization etc. Execute probe programs in different load

conditions (generated using load generating programs)

Correlate probe execution time, system metrics and no load execution time of probe

Page 6: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 6

Probe Selection

Probe should Represent real world applications Have less execution time Non-interactive

Selected probes are Linpack for CPU intensive Bonnie for Disk IO intensive Network IO intensive (not considered)

Page 7: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 7

Load Generating programs

Generate load in given category Should have large execution time Feature for varying the load

Two type of Disk IO load Block IO (IO in large data blocks) Character IO (IO in small data blocks)

Page 8: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 8

SETUP

32 node cluster Each node consists of

[email protected] GHz 640 MB memory 40 GB HDD Redhat Linux version 7.3

EDG Fabric Monitoring System for gathering system metrics

Page 9: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 9

CPU Probe

CPU probe in different loading conditions Correlation using load average

Execution time varies linearly with load average

Problem in block IO load

eLoadAveragQoS

1

1(Equation 1)

Page 10: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 10

CPU Probe

Execution Time vs 1 Min. Load Average

0

100

200

300

400

500

600

700

0 500 1000 1500 2000

1 min. Load Average x 100

CPU Load

Char IO Load

Block IO Load

Page 11: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 11

CPU Probe

Load average represents combined CPU and IO load

CPU probe depends only on CPU load

Two ways to achieve it Average CPU load (VmStatR) Calculate available CPU to probe

Page 12: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 12

CPU Probe

Average CPU Load 1 minute running average of run queue Called VmStatR Predicted QoS will be

VmStatRQoS

1

1(Equation 2)

Page 13: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 13

CPU Probe

Execution Time vs VmStatR

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200 1400 1600

VmStatR x 100

CPU Load

Char IO Load

Block IO Load

Page 14: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 14

CPU Probe

Available CPU to probe Calculate using CPU utilization metric Probe is eligible for

Available Idle time A share of System and User time

100

11

VmStatR

SystemTime

VmStatR

UserTimeIdleTime

QoS (Equation 3)

Page 15: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 15

CPU Probe

Table shows the comparison between QoS predicted using equation 1 & 3 in Block IO load

QoS using Eq. 3 shows correct characteristic

QoS using Equation 1 QoS using Equation 3 Execution Time (Sec)

0.2433 0.4300487 32

0.1605 0.4375441 31

0.1329 0.4624468 32

0.1136 0.415 30

0.1042 0.4536079 31

0.0952 0.4290476 30

0.0869 0.4430435 31

Page 16: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 16

Comparison of results

Compare the QoS results obtained using the three equations for CPU probe in different loads Equation 1 does not give correct

prediction in block IO load conditions Equation 2 & 3 give acceptable results

in any load condition

Page 17: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 17

CPU Probe – Comparison of results

Compar ison of the Measured and Predicted Exec T ime

f or CPU Probe

0

20

40

60

80

100

120

140

160

180

1 2 3 4

LC +LB LC LC +LC h

LC h+LB

Measur ed E xec T ime

E quation 1

E quation 2

E quation 3

LC – CPU LoadLC+LB – CPU + Block IO LoadLC + LCh – CPU + Character IO LoadLCh + LB – Character + Block IO Load

Page 18: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 18

Disk IO Probe

Modified ‘Bonnie’ to perform both as block IO and character IO probe

Considered block IO probe as most of the applications were block IO intensive

Correlate execution time probe under different loading conditions

Predicted QoS using the three equations and compared results

Page 19: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 19

Disk IO Probe – Comparison of results

Compar ison of Measured and Predicted Execution T ime of

Block IO Probe

0

5

10

15

20

25

30

35

40

1 2 3 4LC +LB LC LC +LC h

LC h+LB

Measur ed E xec T ime

E quation 1

E quation 2

E quation 3

LC – CPU LoadLC+LB – CPU + Block IO LoadLC + LCh – CPU + Character IO LoadLCh + LB – Character + Block IO Load

Page 20: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 20

CMSIM Results

Predicted execution time using QoS from Equation 2

% error against the measured one acceptable

Measured Execution Time (Sec) Predicted Execution Time (Sec) % Error

585 610.8687 4.422

739 744.3209 0.720017

929 934.466 0.588377

1082 1080.702 -0.11999

1230 1216.43 -1.10328

1413 1381.166 -2.25294

1687 1707.317 1.204332

Page 21: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 21

Problem Areas

Effect of swapping If available memory is less than the size

of task Linux kernel dynamically changes the

priorities of tasks and swaps tasks accordingly

Difficult to predict QoS

Page 22: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 22

Problem Areas – Swapping

V ar i ati on of M emor y, Swap and E xec T i me

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14

S ample No.

0

50

100

150

200

250

300

350

400

% Used Memory

% Used Swap

Exec T ime

Page 23: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 23

Problem Areas

Metric sampling frequency of monitoring system Immediate metric value ensures better QoS

prediction At higher sampling frequency monitoring loads

the node Change in state after submission of task

QoS can’t consider load changes after submission of task

Submission/removal of other task may change QoS

Page 24: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 24

Conclusion

Equation 2 & 3 provides better QoS for CPU bound applications

Equation 1 can be used for IO bound applications

Successfully predicted for CMSIM – It is mostly cpu bound job

Load balancing programs can use derived equations for job submissions

Page 25: Measuring Quality of Service on  Worker Node in Cluster

15/02/2006 CHEP 06 25

Thanks