Technical Computing Suite Job Management Software … · Elapsed time limit / CPU time limit /...
Transcript of Technical Computing Suite Job Management Software … · Elapsed time limit / CPU time limit /...
Copyright 2012 FUJITSU LIMITED
Technical Computing Suite Job Management Software
PRIMERGY x86 cluster
Supercomputer PRIMEHPC FX10
Toshiaki Mikamo
Fujitsu Limited
Copyright 2012 FUJITSU LIMITED
Outline
System Configuration and Software Stack
Features
The major functions of job scheduler Efficient Resource Usage Fair Share Scheduling System-optimal Resource Assignment
Summary and Future 1
Copyright 2012 FUJITSU LIMITED
Hybrid System Configuration
File management
nodes
Job management
nodes
Control nodes
Login nodes
User
Management nodes
Administrator
Global file system (Data storage area)
Local file system (Temporary area occupied by jobs)
6D mesh/torus Interconnect (Tofu)
IO network (IB), management network (GbE)
• Login • Compilation • Job submission
• System operations management • Job operations management
Supercomputer PRIMEHPC FX10
Fat-Tree Interconnect (Infiniband)
PRIMERGY x86 cluster
2
Technical Computing Suite
Copyright 2012 FUJITSU LIMITED
System Software Stack User/ISV Applications
High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability
HPC Portal / System Management Portal
Supercomputer PRIMEHPC FX10
System operations management System configuration management System control System monitoring System installation & operation
Job operations management Job manager Job scheduler Resource management Parallel execution environment
VISIMPACTTM
Shared L2 cache on a chip Hardware intra-processor
synchronization
Compilers
Support Tools
MPI Library Scalability of High-Func. Barrier Comm.
IDE Profiler & Tuning tools Interactive debugger
Hybrid parallel programming Sector cache support SIMD / Register file extensions
Linux-based enhanced Operating System PRIMERGY x86 cluster
Red Hat Enterprise Linux
3
Copyright 2012 FUJITSU LIMITED
Features Same job operations in FX10 and PRIMERGY
Efficient, fair and system-optimal job scheduling See slide below for details
Resource / Access control Elapsed time limit / CPU time limit / Physical memory limit
Enable / Disable execute permission of job operation commands
Reduce OS jitter / Power saving control
Job statistical information The amount of CPU time / Memory / IO
SIMD rate / MIPS / MFLOPS 4
Copyright 2012 FUJITSU LIMITED
Job Scheduler
Renew our job scheduler for large-scale system
Our job scheduler features: Multi-process
enable to coexist multiple scheduler in a cluster.
Multi-thread enable to balance the load of scheduling.
5
Copyright 2012 FUJITSU LIMITED
Efficient Resource Usage Backfill scheduling for keeping the resources busy Our scheduler manages space(compute nodes) and time.
It will backfill the low priority jobs so as not to prevent high priority jobs. Time
Job A Job B
Job D
Running job
Job D
Job A Job B
Job C
Running job
Job C
Now t1 t2 t3
Not backfilled
Backfilled
Job D
Job C
6
Copyright 2012 FUJITSU LIMITED
Fair Share Scheduling Fairly share resources between users/groups based on past usage.
① Fair share value is issued in advance for each user/group.
② The value is changed by the result of resource usage.
③ The job execution priority is determined dynamically according to the value.
Fair share value is like money.
time
Payment
Return of overpaid
Deposit
Fair
sha
re v
alue
(m
oney)
Payment[P] = (#Node allocated) x (Elapsed time limit of job) Deposit[D] = (Elapsed time) x (Recovery rate) Return of overpaid[R] = P - ((#Node allocated) x (Actual elapsed time of job))
7
Copyright 2012 FUJITSU LIMITED
Optimal Job Scheduling for FX10 Interconnect topology-aware resource assignment One interconnect unit : 12 nodes (2 x 3 x 2)
Job assignment rule: rectangular solid shape Guaranteeing neighbor communication
Avoiding interfering with other jobs
Rotates rectangular solid of interconnect unit to reduce fragmentation
In-use unoccupied
x
z y 8
4 6 8
6
4 6
8
4 4
8
6
8
Asynchronous file staging
Copyright 2012 FUJITSU LIMITED
Login nodes Global file system
(Data storage area)
IO network (IB), management network (GbE)
PRIMEHPC FX10
Time
Stage IN
Stage IN
Job A Job B
Running job Job C
Now t1 t2 t3
Compute nodes
IO nodes
Stage OUT
Stage IN
Stage IN Asynchronously transfer files from Global to Local FS before the job starts.
Co-scheduling of computation and file transfer.
Interconnect
IO nodes
Compute nodes
Local file system
Stage IN/OUT
Job A
Stage OUT
Stage OUT Asynchronously transfer files from Local to Global FS after the job ends.
Async. Async.
Optimal Job Scheduling for FX10
9
Copyright 2012 FUJITSU LIMITED
Fine-grained node assignment Node selection method : balancing / concentration
Rank placement policy : pack / unpack
Priority control of allocated nodes
Execution mode : node is occupied or not by a job.
Strict core assignment Processes are bound to cores in the job territory.
No process can move to cores in other job territory.
Node concentration
Job A
Job B
Job C
Node#0 Node#1 Node#2
Job D
Job B Job A
Rank unpack
R0 R1 R0
R1
Rank pack
Node#1 Node#2 Node#0
Node
0
1
2
3
4
5
6
7
Job A Job B
P
core
Optimal Job Scheduling for PRIMERGY
10
Copyright 2012 FUJITSU LIMITED
Summary and Future
We developed the job management software. Unified operability on PRIMEHPC FX10 and PRIMERGY
New job scheduler : Efficiency, Fairness and System-optimization
Practical resource control and job statistical information
Future Work Operation simulator
Administrator will be able to simulate the operation situation subsequent to operation parameter changes.
11
Copyright 2012 FUJITSU LIMITED 12