A Scalable Heterogeneous Parallelization Framework for
Iterative Local Searches
Martin Burtscher1 and Hassan Rabeti2
1Department of Computer Science, Texas State University-San Marcos2Department of Mathematics, Texas State University-San Marcos
2
Problem: HPC is Hard to Exploit HPC application writers are domain experts
They are not typically computer scientists and have little or no formal education in parallel programming
Parallel programming is difficult and error prone
Modern HPC systems are complex Consist of interconnected compute nodes with
multiple CPUs and one or more GPUs per node Require parallelization at multiple levels (inter-node,
intra-node, and accelerator) for best performance
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
3
Target Area: Iterative Local Searches Important application domain
Widely used in engineering & real-time environments Examples
All sorts of random restart greedy algorithms Ant colony opt, Monte Carlo, n-opt hill climbing, etc.
ILS properties Iteratively produce better solutions Can exploit large amounts of parallelism Often have exponential search space
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
4
Our Solution: ILCS Framework Iterative Local Champion Search (ILCS) framework
Supports non-random restart heuristics Genetic algorithms, tabu search, particle swarm opt, etc.
Simplifies implementation of ILS on parallel systems Design goal
Ease of use and scalability Framework benefits
Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
5
User Interface User writes 3 serial C functions and/or 3 single-
GPU CUDA functions with some restrictions
size_t CPU_Init(int argc, char *argv[]);
void CPU_Exec(long seed, void const *champion, void *result);
void CPU_Output(void const *champion);
See paper for GPU interface and sample code Framework runs Exec (map) functions in parallelA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
6
Internal Operation: Threading
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
Fc Fc Fc …
Fc Fc Fc …
Fc Fc Fc …
Fc Fc Fc …
…
Fg h h Fg h h Fg h … GPU handler thread #1
Fg h h Fg h h Fg h … GPU handler thread #2
user GPU code user GPU code
Fm Fm
user CPU code user CPU code user CPU code
user CPU code
user CPU code
user CPU code
user CPU codeuser CPU code
user CPU code
user CPU code
user CPU code user CPU code
worker thread #1
GPU2 worker threads
master/comm thread
GPU1 worker threads
…
…
worker thread #4
worker thread #3
worker thread #2
user GPU code user GPU code
ILCS master thread starts
master forks a worker per core
master forks a handler per GPU
workers evaluate seeds, record local opt
GPU workers evaluate seeds, record local opt
handlers launch GPU code, sleep, record result
master sporadically finds global opt via MPI, sleeps
7
Internal Operation: Seed Distribution E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)
Benefits Balanced workload irrespective of number of CPU
cores or GPUs (or their relative performance) Users can generate other distributions from seeds
Any injective mapping results in no redundant evaluationsA Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
CPUs GPUs CPUs GPUs CPUs GPUs CPUs GPUs
0, 1, 2, … …, 263-1, 263, … …, 264-2, 264-1
262, ... …, 263-1a b c d a b c d a b 1 2 1 2 1 2 1 2 1 2 1CPU threads (one seed per thread at a time) GPUs (strided range of seeds per GPU at a time)
Node 0 Node 1 Node 2 Node 3
each node gets chunk of 64-bit seed range
CPUs process chunk bottom up
GPUs process chunk top down
8
Related Work MapReduce/Hadoop/MARS and PADO
Their generality and unnecessary features for ILS incur overhead and increase learning curve
Some do not support accelerators, some require Java
ILCS framework is optimized for ILS applications Reduction is provided, does not require multiple keys,
does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
9
Evaluation Methodology Three HPC Systems (at TACC and NICS)
Largest tested configuration
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
compute CPU CPU clock GPU GPU clocknodes cores frequency cores frequency
Keeneland 264 528 4,224 2.6 GHz 792 405,504 1.3 GHzRanger 3,936 15,744 62,976 2.3 GHz - - -Stampede 6,400 12,800 102,400 2.7 GHz 128* n/a n/a
system CPUs GPUs
compute total total total totalnodes CPUs GPUs CPU cores GPU cores
Keeneland 128 256 384 2048 196,608Ranger 2048 8192 0 32768 0Stampede 1024 2048 0 16384 0
system
datacenterknowledge.com
10
Sample ILS Codes Traveling Salesman
Problem (TSP) Find shortest tour
4 inputs from TSPLIB 2-opt hill climbing
Finite State Machine (FSM) Find best FSM config to
predict hit/miss events 4 sizes (n = 3, 4, 5, 6) Monte Carlo method
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0 … 0 0 0 --> d … b a0 … 0 1 0 --> h … f e: : : : --> : : :1 … 1 1 0 --> m … k j0 … 0 0 1 --> q … o n0 … 0 1 1 --> u … s r: : : : --> : : :1 … 1 1 1 --> z … x w
transition
statenextcurrent
state in b
it
n bitsn bits
2n +1 entries
index table
11
FSM Transitions/Second Evaluated
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0.0
5.0
10.0
15.0
20.0
25.0
3-bit FSM 4-bit FSM 5-bit FSM 6-bit FSM
tran
sitio
ns e
valu
ated
per
sec (
trill
ions
)
Keeneland
Ranger
Stampede
21,532,197,798,304 s-1
GPU shmem limit
Ranger uses twice as many cores as Stampede
12
TSP Tour-Changes/Second Evaluated
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
kroE100 ts225 rat575 d1291
mov
es e
valu
ated
per
seco
nd (t
rillio
ns)
Keeneland
Ranger
Stampede
12,239,050,704,370 s-1 based on serial CPU code
CPU pre-computes: O(n2) memory
GPU re-computes: O(n) memory
each core evals a tour change every 3.6
cycles
13
TSP Moves/Second/Node Evaluated
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
kroE100 ts225 rat575 d1291
mov
es e
valu
ated
per
seco
nd (b
illio
ns)
Keeneland
Ranger
Stampede
GPUs provide >90% of performance on Keeneland
14
ILCS Scaling on Ranger (FSM)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
1
10
100
1000
10000
tran
sitio
ns e
valu
ated
per
sec (
billi
ons)
compute nodes
3-bit FSM
4-bit FSM
5-bit FSM
6-bit FSM
>99% parallel efficiency on 2048 nodes
other two systems are similar
15
ILCS Scaling on Ranger (TSP)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0.1
1
10
100
1000
10000
100000
mov
es e
valu
ated
per
seco
nd (b
illio
ns)
compute nodes
kroE100
ts225
rat575
d1291
>95% parallel efficiency on 2048 nodes
longer runs are even better
16
Intra-Node Scaling on Stampede (TSP)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mov
es e
valu
ated
per
seco
nd (b
illio
ns)
worker threads
kroE100
ts225
rat575
d1291
>98.9% parallel efficiency on 16 threads
framework overhead is very small
17
Tour Quality Evolution (Keeneland)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
devi
ation
from
opti
mal
tour
leng
th
step
kroE100
ts225
rat575
d1291
quality depends on chance: ILS provides good solution quickly, then progressively improves it
18
Tour Quality after 6 Steps (Stampede)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
1 2 4 8 16 32 64 128 256 512 1024
devi
ation
from
opti
mal
tour
leng
th
compute nodes
kroE100
ts225
rat575
d1291
larger node counts typically yield better results faster
19
Summary and Conclusions ILCS Framework
Automatic parallelization of iterative local searches Provides MPI, OpenMP, and multi-GPU support
Checkpoints currently best solution every few seconds Scales very well (decentralized)
Evaluation 2-opt hill climbing (TSP) and Monte Carlo method (FSM) AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs
ILCS source code is freely available http://cs.txstate.edu/~burtscher/research/ILCS/
Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
Top Related