Studying the RJMS, applications and File System triptych
Transcript of Studying the RJMS, applications and File System triptych
Studying the RJMS, applications and File System triptych:a first step toward experimental approach
Joseph Emeras Yiannis Georgiou Olivier Richard Philippe Deniel
November 22-24, 2010
Outline
1 ContextThesisPetascale
2 ExperimentingPerformance EvaluationExperimental MethodologyExperimenting: ToolsExperimenting: Steps
3 Bonus
INRIA MESCAL TEAM CEA - CNRS 2 / 43
Outline
1 ContextThesisPetascale
2 ExperimentingPerformance EvaluationExperimental MethodologyExperimenting: ToolsExperimenting: Steps
3 Bonus
INRIA MESCAL TEAM CEA - CNRS Context 3 / 43
PhD Thesis Context
Study the large computing infrastructures
I study their performance
I understand their behaviours
I comprehend their evolution
Consider a global level
I Resource and Job ManagementSystem (RJMS)
I I/OI Distributed File SystemI Storage BackupI NetworkI . . .
I Applications
Cluster Nodes
INRIA MESCAL TEAM CEA - CNRS Context 4 / 43
PhD Thesis Context
Proposal
I experimental approach
I traces (real or synthetic) used to evaluate the system’s performance
I performance evaluation dedicated framework
INRIA MESCAL TEAM CEA - CNRS Context 5 / 43
PhD Thesis Context
Why considering the IOs?
I Generally, one study the RJMS and IOs separately
I In HPC, the File System / Network is a nerve center NERSC’s article
I Incomplete viewI jobs in RJMS do IO operationsI depending on the job’s IO pattern, alter performance
I job itselfI other jobsI overall system
I So, a global view in performance evaluation will consider both RJMSand IOs
INRIA MESCAL TEAM CEA - CNRS Context 6 / 43
PhD Thesis Context: Questions
How to evaluate a large computing infrastructure?
I My organization have such an infrastructureI Nice, but can I get all the platform for my experiments?I If so, how frequently?I Cannot spend too much time monopolizing the platform
I My organization doesn’t have enough resources for thatI How to do???
INRIA MESCAL TEAM CEA - CNRS Context 7 / 43
Petascale Context
Large scale computing resources
Goals
I Study the architecture as a whole
I physical infrastructureI softwareI users
I Understand the global behavior todiagnose problems
INRIA MESCAL TEAM CEA - CNRS Context 8 / 43
Petascale Context
Constraints
I thousands of compute nodes to manage
I high occupation rate of the resources
I failures
I IO congestion: FS / network
Global study of the platform will consider
I Resource and Job Management System
I File System (Distributed)
I user’s applications
Need
I having some knowledge of the applications requirements
I how they react to IO variation (in terms of performance)
INRIA MESCAL TEAM CEA - CNRS Context 9 / 43
Outline
1 ContextThesisPetascale
2 ExperimentingPerformance EvaluationExperimental MethodologyExperimenting: ToolsExperimenting: Steps
3 Bonus
INRIA MESCAL TEAM CEA - CNRS Experimenting 10 / 43
Performance Evaluation
Challenges when evaluating a platform’s performance
I The study of HPC systems depends on a large number of parametersand conditions
I Need: ease experimentation Kameleon
I reproducibility of the resultsI recreate experiment’s environment
I How to do the experiment?
I How to evaluate per se?
I Analysis: log results, experiment’s condition, environment setup andparameters
INRIA MESCAL TEAM CEA - CNRS Experimenting 11 / 43
Preliminary results
Experiment conduct platform
Kameleon tool: recipes to recreate a software environment Kameleon
I RJMS: OAR, SLURM
I File System: Lustre, NFS
I Evaluation tools: esp, xionee
I Export to several output formats: kvm, g5k
INRIA MESCAL TEAM CEA - CNRS Experimenting 12 / 43
How to do the experiment?
INRIA MESCAL TEAM CEA - CNRS Experimenting 13 / 43
How to do the experiment?
INRIA MESCAL TEAM CEA - CNRS Experimenting 13 / 43
How to do the experiment?
INRIA MESCAL TEAM CEA - CNRS Experimenting 13 / 43
How to do the experiment?
INRIA MESCAL TEAM CEA - CNRS Experimenting 13 / 43
How to do the experiment?
INRIA MESCAL TEAM CEA - CNRS Experimenting 13 / 43
How to do the experiment?
INRIA MESCAL TEAM CEA - CNRS Experimenting 13 / 43
How to do the experiment?
INRIA MESCAL TEAM CEA - CNRS Experimenting 13 / 43
Evaluating scalability
How to evaluate large scale performance?
I Not enough physical resources for that
I Use of emulation to virtually enlarge our infrastructure
I Keep in mind: valuing the noise generated by the experiments
I But also by the emulation itself: performance loss
INRIA MESCAL TEAM CEA - CNRS Experimenting 14 / 43
Proposed Experimental Methodology for PerformanceEvaluation
I From a physical infrastructureI Optionnal: emulate to make it bigger and evaluate scalability
I Inject load on the RJMS and File SystemI from real-life experiments - tracesI from benchmark patterns - synthetic workloads
I Evaluate the system’s load and responsiveness
I Log the experiment’s condition and results in a “notebook“ (like
physicists do)
I Analysis, comparison with other experiments
INRIA MESCAL TEAM CEA - CNRS Experimenting 15 / 43
Modelling the Workload
Jobs Workload modelling
I Performance evaluation by executing a sequence of jobs
I Two common ways to use a workload for system evaluationI either a workload log (like the SWF Workload Format )I or a workload model (synthetic workload like ESP benchmark [Kra08])
File System activity modelling
I Performance evaluation by executing an IO pattern
I Same possibilitiesI either an IO logI or an IO model (patterns in IOs benchmarks like IOR [SAS08])
Modelling both the Jobs and FS Workload: Difficulties
I correlating jobs workload logs with IO logs
I smartly mixing patterns
I stage-in mechanism, application checkpoint
INRIA MESCAL TEAM CEA - CNRS Experimenting 16 / 43
Tools for the experiments
Grid’5000
I INRIA-CNRS
I 10 sites: 9 in France, 1 inBrazil
I more than 5000 cores
I highly reconfigurableexperimental platform
TGCC
I CEA
I equivalent to CEA’s Tera100(military classified)
I 60MW
I petaflop
I 7774 square yds building
INRIA MESCAL TEAM CEA - CNRS Experimenting 17 / 43
Experiments: first step and later
First Step
I experiment separatly for referenceI benchmark several RJMS with esp2I benchmark several Distributed FS with IOR, GoFilebench, IOZone
I mix esp2 jobs and FS benchmarks IO patterns
I compare
What next
I use real RJMS and IO traces: correlation?
I from jobs patterns to jobs workload patterns
I use emulation for scalability testing
I compare results emulated vs normal
I evaluating the experiment’s ”noise” on the different cases:I emulated or notI RJMS - IO
INRIA MESCAL TEAM CEA - CNRS Experimenting 18 / 43
Roadmap
Goal
Identify the problems one can encounter when doing an experimentalstudy in this domain
I Experiment with synthetic traces (esp - FS benchmarks) ≈I Tool to help reconstructing experiments environments XI Correlate FS traces with jobs traces ≈I Getting some complete real traces xI Playing with emulation ≈I Performance evaluation framework xI Follow the idea of [UAUS10] to partition the Bandwith and reserve
dynamically the resources xI . . .
INRIA MESCAL TEAM CEA - CNRS Experimenting 19 / 43
Thank you for your attention
Questions / Comments ?
INRIA MESCAL TEAM CEA - CNRS Experimenting 20 / 43
Bibliography
William TC Kramer.PERCU: A Holistic Method for Evaluating High PerformanceComputing Systems.PhD thesis, EECS Department, University of California, Berkeley,Nov 2008.
Hongzhang Shan, Katie Antypas, and John Shalf.Characterizing and predicting the i/o performance of hpcapplications using a parameterized synthetic benchmark.In SC, page 42, 2008.
Andrew Uselton, Katie Antypas, Daniela Ushizima, and JeffreySukharev.File system monitoring as a window into user i/o requirements.In CUG, 2010.
INRIA MESCAL TEAM CEA - CNRS Experimenting 21 / 43
Outline
1 ContextThesisPetascale
2 ExperimentingPerformance EvaluationExperimental MethodologyExperimenting: ToolsExperimenting: Steps
3 Bonus
INRIA MESCAL TEAM CEA - CNRS Bonus 22 / 43
Experimenting
How to ensure our experiments are valuable?
Performance Results
INRIA MESCAL TEAM CEA - CNRS Bonus 23 / 43
Reproducibility in experiments
Reproducibility
I Of the resultsI need to reproduce the experiments
Experiment
I One definition: “A test under controlled conditions that is madeto demonstrate a known truth, examine the validity of a hypothesis,or determine the efficacy of something previously untried.”(http://www.thefreedictionary.com/experiment)
INRIA MESCAL TEAM CEA - CNRS Bonus 24 / 43
Reproducibility in experiments
Experiment
I Observe the experiment conditions:I the Object (experiment plan: parameters, configuration...)
I and its environment: software and hardware
I Need of a complete environment
I Experiments conditions determine experiment itself
I Reproducibility: need to recreate the environment (at least software)
⇓Need an environment model!
INRIA MESCAL TEAM CEA - CNRS Bonus 25 / 43
Experiments reconstructability
Experiment depends on the environment.
Reconstructability
I Replay the experiments in the same conditions
I Recreate the experiment’s environmentI Hardware: hardI Software: ok
Software environment
Softwares evolve going forward
I Software versions
I configurations
INRIA MESCAL TEAM CEA - CNRS Bonus 26 / 43
Kameleon
A Tool to Generate Software Appliances
INRIA MESCAL TEAM CEA - CNRS Bonus 27 / 43
Kameleon: a tool to generate software appliances
A tool that generates software appliances from an environmentdescription:
I Recipe (high level)
I Steps (mid and low-level)
One can combine this to:
I Model the software environment: recreate it ad libitum
I Everything described: keep log of what has been done
Goals:
I Simple
I Easy to handle
INRIA MESCAL TEAM CEA - CNRS Bonus 28 / 43
Kameleon: a tool to generate software appliances
Basics
INRIA MESCAL TEAM CEA - CNRS Bonus 29 / 43
Kameleon: a tool to generate software appliances
Recipes and Steps
INRIA MESCAL TEAM CEA - CNRS Bonus 30 / 43
Kameleon history
I Initially: create appliances for OAR testing
I 2nd step: make it more generic
I 3rd step: provide basic steps
I 4rd step: provide specific steps (G5K, OAR, SLURM, ESP2,Xionee...) In progress
I 5th step: world contamination...
INRIA MESCAL TEAM CEA - CNRS Bonus 31 / 43
Kameleon cool features
I Checkpoint
I Embedded Shell
I Idempotent (depends of the command itself)
I Stand alone
I Several abstraction levels (Recipes, Macrosteps, Microsteps)
I Simple but powerfull
INRIA MESCAL TEAM CEA - CNRS Bonus 32 / 43
Kameleon more technical features: commands
I File operations: append, create
I Macrosteps dependencies
I Breakpoints
I Check of the presence of a command
I Execution of a command
INRIA MESCAL TEAM CEA - CNRS Bonus 33 / 43
Kameleon Bonus
Friendliness: embedded shell
I history
I colored
I manual step execution
I retry on error
I environment variables
Hooks
I customizable clean
INRIA MESCAL TEAM CEA - CNRS Bonus 34 / 43
Kameleon Bonus
YAML powered
I indentation
I code readability
Provided ingredients
I appliance with kameleon preinstalled
I default recipes/steps to use/play withI DebianI RedhatI Grid5000
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 35 / 43
Appendix
INRIA MESCAL TEAM CEA - CNRS Bonus 36 / 43
Standard Workload FormatDefinition of SWF format to describe the execution of a sequence of jobs.; Computer: Linux cluster (Atlas)
; Installation: High Performance Computing - Lawrence Livermore National Laboratory
; MaxJobs: 60332
; MaxRecords: 60332
; Preemption: No
; UnixStartTime: 1163199901
; StartTime: Fri Nov 10 15:05:01 PST 2006
; EndTime: Fri Jun 29 14:02:41 PDT 2007
; MaxNodes: 1152
; MaxProcs: 9216
; Note: Scheduler is Slurm (https://computing.llnl.gov/linux/slurm/)
; MaxPartitions: 4
; Partition: 1 pbatch
; Partition: 2 pdebug
; Partition: 3 pbroke
; Partition: 4 moody20
; j| s| w| r| p| c| m| p| u| m| s| u| g| e| q| p| p| t
; o| u| a| u| r| p| e| r| s| e| t| i| i| x| | a| r| h
; b| b| i| n| o| u| m| o| e| m| a| d| d| e| n| r| e| i
; | m| t| t| c| | | c| r| | t| | | | u| t| v| n
; | i| | i| | u| u| | | r| u| | | n| m| i| | k
; | t| | m| a| s| s| r| e| e| s| | | u| | t| j|
; | | | e| l| e| e| e| s| q| | | | m| | i| o| t
; | | | | l| d| d| q| t| | | | | | | o| b| i
; | | | | o| | | | | | | | | | | n| | m
; | | | | c| sec| Kb| | | | | | | | | | | e
2488 4444343 0 8714 1024 -1 -1 1024 -1 -1 0 17 -1 307 -1 1 -1 -1
2489 4444343 0 103897 1024 -1 -1 1024 -1 -1 1 17 -1 309 -1 1 -1 -1
2490 4447935 0 634 2336 -1 -1 2336 10800 -1 1 3 -1 5 -1 1 -1 -1
2491 4448583 0 792 2336 -1 -1 2336 10800 -1 1 3 -1 5 -1 1 -1 -1
2492 4449388 0 284 2336 -1 -1 2336 10800 -1 0 3 -1 5 -1 1 -1 -1
→ But lack of information about IOs: FS, network
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 37 / 43
ESP2 Benchmark
I provide a quantitative evaluation of launching and scheduling via a single metric: time
I Complete independence from the hardware performance
I Ability for scalability evaluation of the RJMS
ESPEfficiency =TheoreticDuration
MeasuredDuration(1)
Job Type Fraction of Job Size Job size for a 512cores Count of the number Target Run Timerelative to total system cluster (in cores) of job instance (Seconds)
sizeA 0.03125 16 75 267B 0.06250 32 9 322... ... ... ... ...L 0.12500 64 36 366M 0.25000 128 15 187Z 1.00000 512 2 100
Total 230
Table: ESP2 benchmark [Kra08] characteristics for a 512 cores cluster
Courtesy of Yiannis Georgiou
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 38 / 43
ESP2 Benchmark
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000
Pa
rtitio
n S
ize
Utiliz
atio
n
Time sec
SLURMJob Start Time Impulse
Figure: ESP2 benchmarkfor SLURM RJMS with 64resources - no Z jobs
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000
Pa
rtitio
n S
ize
Utiliz
atio
n
Time sec
OARJob Start Time Impulse
Figure: ESP2 benchmarkfor OAR RJMS with 64resources - no Z jobs
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR 64 resources - 16 xen virtual machines / 4 cores each
(virtual machines running on 4 bi-dual cores)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmarkfor OAR RJMS with 64virtual (xen) resources - noZ jobs
Comparison between Slurm, OAR and OAR in a virtual cluster - 64resources
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 39 / 43
ESP2 Benchmark
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000
Pa
rtitio
n S
ize
Utiliz
atio
n
Time sec
SLURMJob Start Time Impulse
Figure: ESP2 benchmark for SLURM RJMS with 64 resources - no Z jobs
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 40 / 43
ESP2 Benchmark
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000
Pa
rtitio
n S
ize
Utiliz
atio
n
Time sec
OARJob Start Time Impulse
Figure: ESP2 benchmark for OAR RJMS with 64 resources - no Z jobs
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 40 / 43
ESP2 Benchmark
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR 64 resources - 16 xen virtual machines / 4 cores each
(virtual machines running on 4 bi-dual cores)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmark for OAR RJMS with 64 virtual (xen) resources - no Z jobs
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 40 / 43
ESP2 Benchmark
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Co
res
Time sec
System utilization for ESP synthetic workload and SLURM - 512 cores
SLURMJob Start Time Impulse
Figure: ESP2 benchmarkfor SLURM RJMS with512 resources
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR (64nodes-biCPU/quadCORE)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmarkfor OAR RJMS with 512resources
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR 512 resources - 64 xen virtual machines / 8 cores each
(virtual machines running on 8 bi-quad cores)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmarkfor OAR RJMS with 512virtual (xen) resources
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR - OCaml scheduler 512 resources - 64 xen virtual machines / 8 cores each
(virtual machines running on 8 bi-quad cores)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmarkfor OAR RJMS with 512virtual (xen) resources -OCaml scheduler
Comparison between Slurm and OAR - 512 resources
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 41 / 43
ESP2 Benchmark
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Co
res
Time sec
System utilization for ESP synthetic workload and SLURM - 512 cores
SLURMJob Start Time Impulse
Figure: ESP2 benchmark for SLURM RJMS with 512 resources
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 42 / 43
ESP2 Benchmark
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR (64nodes-biCPU/quadCORE)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmark for OAR RJMS with 512 resources
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 42 / 43
ESP2 Benchmark
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR 512 resources - 64 xen virtual machines / 8 cores each
(virtual machines running on 8 bi-quad cores)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmark for OAR RJMS with 512 virtual (xen) resources
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 42 / 43
ESP2 Benchmark
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Nu
mb
er
of
Re
so
urc
es
Time sec
Resources Utilization for ESP benchmark with OAR - OCaml scheduler 512 resources - 64 xen virtual machines / 8 cores each
(virtual machines running on 8 bi-quad cores)
UtilizationJob Start Time Impulse
Figure: ESP2 benchmark for OAR RJMS with 512 virtual (xen) resources - OCaml scheduler
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 42 / 43
[UAUS10]
In [UAUS10], NERSC experimented on the Cray XT4 that, staticallypartitioning their File System (Bandwith) in two parts:
I one dedicated to ”big IOs” users
I one for the ”regular” users
lead to improving the overall performance and decreasing the variabilityin IO performance.
Back
INRIA MESCAL TEAM CEA - CNRS Bonus 43 / 43