Exploiting Modern Hardware Features via Lightweight Profiling
Probir RoyScalable Tools Workshop’19
1
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
IBM POWER 9 CPU
2
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Amazon CloudFront
IBM POWER 9 CPU
2
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Amazon CloudFront
NAMD
IBM POWER 9 CPU
2
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Amazon CloudFront
NAMD
MPI
IBM POWER 9 CPU
2
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Amazon CloudFront
NAMD
MPI
IBM POWER 9 CPU
Common Goal: Performance
2
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Amazon CloudFront
NAMD
MPI
IBM POWER 9 CPU
Common Goal: Performance
Variable characteristics of hardware and software is a challenge
2
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Amazon CloudFront
NAMD
MPI
IBM POWER 9 CPU
Common Goal: Performance
Variable characteristics of hardware and software is a challenge
Deep insights2
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
3
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Peak FLOPS per socket increasing at 50%-60% per year
http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
3
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Peak FLOPS per socket increasing at 50%-60% per year
Memory bandwidth increasing at ~23% per year
http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
3
High performance and challenges
Exploiting Modern Hardware Features via Lightweight Profiling
Peak FLOPS per socket increasing at 50%-60% per year
Memory bandwidth increasing at ~23% per year
Memory latency increasing at ~4% per year
http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
3
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
Application
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
ApplicationWhat?
Why?
How?
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
ProfilerApplicationWhat?
Why?
How?
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
ProfilerApplicationWhat?
Why?
How?
Simulation Measurement
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
Profiler
Profiles
ApplicationWhat?
Why?
How?
Simulation Measurement
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
Profiler
Profiles
Analyzer
ApplicationWhat?
Why?
How?
Simulation Measurement
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
Profiler
Profiles
AnalyzerCode
optimization
ApplicationWhat?
Why?
How?
Simulation Measurement
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
Profiler
Profiles
AnalyzerCode
optimization
ApplicationWhat?
Why?
How?
Simulation Measurement
4
Steps of performance analysis
Exploiting Modern Hardware Features via Lightweight Profiling
Profiler
Profiles
AnalyzerCode
optimization
ApplicationWhat?
Why?
How?
Simulation Measurement
4
Limitations of performance analysis (Cont.)
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead
Exploiting Modern Hardware Features via Lightweight Profiling5
Limitations of performance analysis (Cont.)
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overheadCache simulation: average 38x (Xiang et al. A higher order theory of locality )
Exploiting Modern Hardware Features via Lightweight Profiling5
Limitations of performance analysis (Cont.)
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead
Selective instrumentation: 2x - 5x (Rane et al. MACPO)
Cache simulation: average 38x (Xiang et al. A higher order theory of locality )
Exploiting Modern Hardware Features via Lightweight Profiling5
Limitations of performance analysis (Cont.)
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead
Selective instrumentation: 2x - 5x (Rane et al. MACPO)
Cache simulation: average 38x (Xiang et al. A higher order theory of locality ) Profiling: < 10% (Liu et al. A Data-centric Profiler for Parallel Programs)
Exploiting Modern Hardware Features via Lightweight Profiling5
Limitations of performance analysis (Cont.)
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead
Goal
Selective instrumentation: 2x - 5x (Rane et al. MACPO)
Cache simulation: average 38x (Xiang et al. A higher order theory of locality ) Profiling: < 10% (Liu et al. A Data-centric Profiler for Parallel Programs)
Exploiting Modern Hardware Features via Lightweight Profiling5
Research statement
Exploiting Modern Hardware Features via Lightweight Profiling
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead6
Research statement
Exploiting Modern Hardware Features via Lightweight Profiling
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead
Lightweight profiling with PMUs can provide deep insights into performance issues caused by memory hierarchies and poor algorithm choice
6
Research statement
Exploiting Modern Hardware Features via Lightweight Profiling
Low overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead
Tools to detect memory and computational
inefficiency
Lightweight profiling with PMUs can provide deep insights into performance issues caused by memory hierarchies and poor algorithm choice
6
My research at a glance
Exploiting Modern Hardware Features via Lightweight Profiling
CPU
Cache
Physical Memory
Mem
ory
Inef
fici
ency
Programming model
7
My research at a glance
Exploiting Modern Hardware Features via Lightweight Profiling
CPU
Cache
Physical Memory
Simultaneous multi-threading Memory contention
Mem
ory
Inef
fici
ency
Programming model
7
My research at a glance
Exploiting Modern Hardware Features via Lightweight Profiling
CPU
Cache
Physical Memory
Simultaneous multi-threading Memory contention
Set-associative cache Conflict miss
Mem
ory
Inef
fici
ency
Programming model
7
My research at a glance
Exploiting Modern Hardware Features via Lightweight Profiling
CPU
Cache
Physical Memory
Simultaneous multi-threading Memory contention
Set-associative cache Conflict miss
Cache line utilization
Mem
ory
Inef
fici
ency
Programming model
7
My research at a glance
Exploiting Modern Hardware Features via Lightweight Profiling
CPU
Cache
Physical Memory
Simultaneous multi-threading Memory contention
Set-associative cache Conflict miss
Cache line utilization
Non-uniform memory
Scalability
Mem
ory
Inef
fici
ency
Programming model
7
My research at a glance
Exploiting Modern Hardware Features via Lightweight Profiling
CPU
Cache
Physical Memory
Simultaneous multi-threading Memory contention
Set-associative cache Conflict miss
Cache line utilization
Non-uniform memory
Scalability
Mem
ory
Inef
fici
ency
Programming model Charm++
7
My research at a glance
Exploiting Modern Hardware Features via Lightweight Profiling
CPU
Cache
Physical Memory
Simultaneous multi-threading Memory contention
Set-associative cache Conflict miss
Cache line utilization
Non-uniform memory
Scalability
Mem
ory
Inef
fici
ency
Programming model Charm++
7
Outline
• Lightweight profiling
• SMT-aware optimization
• Detection of cache conflicts
• Guiding data-structure layout transformation
Exploiting Modern Hardware Features via Lightweight Profiling8
Lightweight memory profiling
• Hardware profiling
• Event based sampling• Intel (Precise event based sampling - PEBS)
• AMD (Instruction based sampling - IBS)
• IBM (Marked event sampling - MRK)
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
9
Lightweight memory profiling
• Hardware profiling
• Event based sampling• Intel (Precise event based sampling - PEBS)
• AMD (Instruction based sampling - IBS)
• IBM (Marked event sampling - MRK)
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
ApplicationTime
9
Lightweight memory profiling
• Hardware profiling
• Event based sampling• Intel (Precise event based sampling - PEBS)
• AMD (Instruction based sampling - IBS)
• IBM (Marked event sampling - MRK)
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
ApplicationTime
9
Lightweight memory profiling
• Hardware profiling
• Event based sampling• Intel (Precise event based sampling - PEBS)
• AMD (Instruction based sampling - IBS)
• IBM (Marked event sampling - MRK)
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
Sample Sample Sample Sample
ApplicationTime
9
Lightweight memory profiling
• Hardware profiling
• Event based sampling• Intel (Precise event based sampling - PEBS)
• AMD (Instruction based sampling - IBS)
• IBM (Marked event sampling - MRK)
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
Sample Sample Sample Sample
ApplicationTime
Reference Type
Data AddressInstruction
Pointer
{L1 miss, L2 hit etc.}
9
Outline
✓Lightweight profiling
✓SMT-aware optimization
• Detection of cache conflicts
• Guiding data-structure layout transformation
Exploiting Modern Hardware Features via Lightweight Profiling10
SMT-Aware Instantaneous Footprint Optimization
Probir Roy, Shuaiwen Leon Song, Xu Liu
[HPDC – 2016]
Exploiting Modern Hardware Features via Lightweight Profiling11
SMT (Simultaneous Multi-Threading)
Superscalar 2-way SMT
Clo
ck C
ycle
s
Thread 1 Thread 2
Idle Cycle
Exploiting Modern Hardware Features via Lightweight Profiling12
SMT scalability
Runtime ratio = SMT runtime / non-SMT runtime
Shared memory SPMD application
Exploiting Modern Hardware Features via Lightweight Profiling
Ru
nti
me
rati
o
Lower is better
13
SMT architecture: shared cache
Exploiting Modern Hardware Features via Lightweight Profiling
Thread 1 Thread 2
Core 1
Thread 3 Thread 4
Core 1
L1 Cache L1 Cache
L2 Cache
LLC Cache
14
SMT: Memory scalability
SMT scaling factor (F) = access Latency of SMT/ access Latency of non-SMT
Exploiting Modern Hardware Features via Lightweight Profiling
SMT
scal
ing
fact
or
Lower is better
15
Characterization based on sensitivity
(L,F) Benchmarks Characterization
(high, high) srad, streamcluster1, Lulesh2.0, IRSmk, LU, 3D tensor, Stencil, streamcluster2,
hotspot, Clomp
potentially sensitiveto mem-centric SMT
optimizations
(high, low) lud, needle, bfs, nn, bp, canneal,Ferret
not clear if they canfurther benefit fromSMT optimizations
(low, high) leucocite, heartwall, pathfinder, myocyte
little benefit frommem-centric SMT
optimization
(low, low) b+tree, cfd, kmeans, lavaMD, particle filter, hotspot3D, blackscholes,
bodytrack, facesim,Swaptions
good memoryperformance with
SMT enabled
L = Memory Access Latency; F = scaling factor
Exploiting Modern Hardware Features via Lightweight Profiling16
Characterization based on sensitivity
(L,F) Benchmarks Characterization
(high, high) srad, streamcluster1, Lulesh2.0, IRSmk, LU, 3D tensor, Stencil, streamcluster2,
hotspot, Clomp
potentially sensitiveto mem-centric SMT
optimizations
(high, low) lud, needle, bfs, nn, bp, canneal,Ferret
not clear if they canfurther benefit fromSMT optimizations
(low, high) leucocite, heartwall, pathfinder, myocyte
little benefit frommem-centric SMT
optimization
(low, low) b+tree, cfd, kmeans, lavaMD, particle filter, hotspot3D, blackscholes,
bodytrack, facesim,Swaptions
good memoryperformance with
SMT enabled
L = Memory Access Latency; F = scaling factor
Exploiting Modern Hardware Features via Lightweight Profiling16
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
17
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
Intra-threadSMT thread 2SMT thread 1
time
space
17
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
Intra-threadSMT thread 2SMT thread 1
time
space
17
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
Intra-threadSMT thread 2SMT thread 1
time
space
Optimization: Improvecache line utilization 17
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
Inter-threadIntra-threadSMT thread 2SMT thread 1
time
space
Optimization: Improvecache line utilization 17
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
Inter-threadIntra-threadSMT thread 2SMT thread 1
time
space
Optimization: Improvecache line utilization
time
space
17
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
Inter-threadIntra-threadSMT thread 2SMT thread 1
time
space
Optimization: Improvecache line utilization
Cache line 1
Cache line 1
time
space
17
Source of memory contention
Exploiting Modern Hardware Features via Lightweight Profiling
Little/no locality
Inter-threadIntra-threadSMT thread 2SMT thread 1
time
space
Optimization: Improvecache line utilization
Cache line 1
Cache line 1
time
space
Optimization:Collaboration 17
SMT locality (Stencil code)
Exploiting Modern Hardware Features via Lightweight Profiling
#pragma omp parallel forfor (int i=T; i
SMT locality (Stencil code)
Exploiting Modern Hardware Features via Lightweight Profiling
#pragma omp parallel forfor (int i=T; i
SMT locality (Stencil code)
Exploiting Modern Hardware Features via Lightweight Profiling
#pragma omp parallel forfor (int i=T; i
SMT locality (Stencil code)
Exploiting Modern Hardware Features via Lightweight Profiling
#pragma omp parallel forfor (int i=T; i
SMT locality (Stencil code)
Exploiting Modern Hardware Features via Lightweight Profiling
#pragma omp parallel forfor (int i=T; i
SMT-Analyzer: Analyzing memory access pattern
Exploiting Modern Hardware Features via Lightweight Profilingthreads
Me
mo
ry r
ange
T1 T2 T3 T4 T5Loop
19
SMT-Analyzer: Analyzing memory access pattern
Exploiting Modern Hardware Features via Lightweight Profilingthreads
Me
mo
ry r
ange
T1 T2 T3 T4 T5
PMU
Loop
19
SMT-Analyzer: Analyzing memory access pattern
Exploiting Modern Hardware Features via Lightweight Profilingthreads
Me
mo
ry r
ange
T1 T2 T3 T4 T5
PMU S S S SApplication Time
Reference Type
Data Address
Instruction Pointer
Thread
Loop
19
SMT-Analyzer: Analyzing memory access pattern
Exploiting Modern Hardware Features via Lightweight Profilingthreads
Me
mo
ry r
ange
T1 T2 T3 T4 T5
PMU S S S SApplication Time
Reference Type
Data Address
Instruction Pointer
Thread
Loop
19
Benchmarks
Benchmarks bottleneck region % of total latency
overhead OPT method Speedups
lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43×
IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86×
needle needle.cpp:185-187 20% +2.99% inter-thread 2.37×
srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74×
LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36×
Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9×
3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44×
streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×
Exploiting Modern Hardware Features via Lightweight Profiling20
Benchmarks
Benchmarks bottleneck region % of total latency
overhead OPT method Speedups
lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43×
IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86×
needle needle.cpp:185-187 20% +2.99% inter-thread 2.37×
srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74×
LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36×
Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9×
3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44×
streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×
Exploiting Modern Hardware Features via Lightweight Profiling20
Benchmarks
Benchmarks bottleneck region % of total latency
overhead OPT method Speedups
lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43×
IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86×
needle needle.cpp:185-187 20% +2.99% inter-thread 2.37×
srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74×
LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36×
Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9×
3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44×
streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×
Exploiting Modern Hardware Features via Lightweight Profiling20
Benchmarks
Benchmarks bottleneck region % of total latency
overhead OPT method Speedups
lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43×
IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86×
needle needle.cpp:185-187 20% +2.99% inter-thread 2.37×
srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74×
LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36×
Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9×
3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44×
streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×
Exploiting Modern Hardware Features via Lightweight Profiling20
Benchmarks
Benchmarks bottleneck region % of total latency
overhead OPT method Speedups
lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43×
IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86×
needle needle.cpp:185-187 20% +2.99% inter-thread 2.37×
srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74×
LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36×
Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9×
3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44×
streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×
Exploiting Modern Hardware Features via Lightweight Profiling
Related work: MACPO (selective instrumentation): 2x - 5x
20
Outline
✓Lightweight profiling
✓SMT-aware optimization
•Detection of cache conflicts• Guiding data-structure layout transformation
Exploiting Modern Hardware Features via Lightweight Profiling21
Lightweight Detection of Cache Conflicts
Probir Roy, Shuaiwen Leon Song, Sriram Krishnamoorthy, Xu Liu
[CGO – 2018]
Exploiting Modern Hardware Features via Lightweight Profiling22
Set-associative cache
Exploiting Modern Hardware Features via Lightweight Profiling
8 waySet 0
Set 1
Set 63
.
.
.
Cache Line
Intel SkylakeL1 cache: 32 KB
23
Set-associative cache
Exploiting Modern Hardware Features via Lightweight Profiling
8 waySet 0
Set 1
Set 63
.
.
.
Cache Line
Address64 Bits
Intel SkylakeL1 cache: 32 KB
23
Set-associative cache
Exploiting Modern Hardware Features via Lightweight Profiling
8 waySet 0
Set 1
Set 63
.
.
.
Cache Line
Address64 Bits
TAG SET Index Offset
Intel SkylakeL1 cache: 32 KB
23
Set-associative cache
Exploiting Modern Hardware Features via Lightweight Profiling
8 waySet 0
Set 1
Set 63
.
.
.
Cache Line
Address64 Bits
TAG SET Index Offset
Intel SkylakeL1 cache: 32 KB
23
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
0 16 32 48 0 16 32 48 0Time
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
0 16 32 48 0 16 32 48 0Time
Pad
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
Set mappingafter padding
0 1 15 16
17 32 33
49
51 52 2
4 5 19
21 22 36
55 56 6
18
34 35 50
20
37
7
3
…
Pad128
0 16 32 48 0 16 32 48 0Time
Pad
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
Set mappingafter padding
0 1 15 16
17 32 33
49
51 52 2
4 5 19
21 22 36
55 56 6
18
34 35 50
20
37
7
3
…
Pad128
0 16 32 48 0 16 32 48 0Time
Pad
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
Set mappingafter padding
0 1 15 16
17 32 33
49
51 52 2
4 5 19
21 22 36
55 56 6
18
34 35 50
20
37
7
3
…
Pad128
0 16 32 48 0 16 32 48 0Time
0 817 34 51 4 21 48 55Time
Pad
24
Set conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
double Array [20,000][128];
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
Set mappingafter padding
0 1 15 16
17 32 33
49
51 52 2
4 5 19
21 22 36
55 56 6
18
34 35 50
20
37
7
3
…
Pad128
0 16 32 48 0 16 32 48 0Time
0 817 34 51 4 21 48 55Time
Pad
Is your application suffering conflict cache miss?
24
L1 cache
Conflict cache miss
Memory trace
Cache simulator
Classifying miss
Simulation methods
Exploiting Modern Hardware Features via Lightweight Profiling
Trace driven cache simulation
A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0]Time
25
L1 cache
Conflict cache miss
Memory trace
Cache simulator
Classifying miss
Simulation methods
Exploiting Modern Hardware Features via Lightweight Profiling
Trace driven cache simulation
A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0]Time
Overhead: average 38 times
25
L1 cache
Conflict cache miss
Memory trace
Cache simulator
Classifying miss
Simulation methods
Exploiting Modern Hardware Features via Lightweight Profiling
Trace driven cache simulation
A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0]Time
Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.
Overhead: average 38 times
25
L1 cache
Conflict cache miss
Memory trace
Cache simulator
Classifying miss
Simulation methods
Exploiting Modern Hardware Features via Lightweight Profiling
Trace driven cache simulation
A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0]TimeHigh overhead
Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.
Overhead: average 38 times
25
L1 cache
Conflict cache miss
Memory trace
Cache simulator
Classifying miss
Simulation methods
Exploiting Modern Hardware Features via Lightweight Profiling
Trace driven cache simulation
A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0]TimeHigh overhead
Difficult to simulate hardware
Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.
Overhead: average 38 times
25
L1 cache
Conflict cache miss
Memory trace
Cache simulator
Classifying miss
Simulation methods
Exploiting Modern Hardware Features via Lightweight Profiling
Trace driven cache simulation
A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0]TimeHigh overhead
Difficult to simulate hardware
Difficult in practiceTheoretically accurate
Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.
Overhead: average 38 times
25
Exploiting Modern Hardware Features via Lightweight Profiling
Memory trace
Cache simulator
Classifying miss
Simulation methods
A practical low overhead solution
26
Exploiting Modern Hardware Features via Lightweight Profiling
CCProf
Memory trace
Cache simulator
Classifying miss
Simulation methods
A practical low overhead solution
26
Exploiting Modern Hardware Features via Lightweight Profiling
CCProf
Measurement methods
Memory trace
Cache simulator
Classifying miss
Simulation methods
A practical low overhead solution
26
Exploiting Modern Hardware Features via Lightweight Profiling
CCProf
Memory sampling
Statistical analysis
Classifying miss
Measurement methods
Memory trace
Cache simulator
Classifying miss
Simulation methods
A practical low overhead solution
26
Exploiting Modern Hardware Features via Lightweight Profiling
CCProf
Memory sampling
Statistical analysis
Classifying miss
Measurement methods
Memory trace
Cache simulator
Classifying miss
Simulation methods
A practical low overhead solution
Overhead
>>
Accuracy
~26
Hardware-based address sampling (Cont.)
A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]Memory
references
L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss
Exploiting Modern Hardware Features via Lightweight Profiling
Time
27
Hardware-based address sampling (Cont.)
A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]Memory
references
L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss
Exploiting Modern Hardware Features via Lightweight Profiling
Time
A[0][0] A[4][0] A[2][0]Precise event
sampling (PEBS)
PMU
27
Hardware-based address sampling (Cont.)
A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]Memory
references
L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss
Exploiting Modern Hardware Features via Lightweight Profiling
TAG SET Index Offset
Time
A[0][0] A[4][0] A[2][0]Precise event
sampling (PEBS)
PMU
27
Hardware-based address sampling (Cont.)
A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]Memory
references
L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss
Exploiting Modern Hardware Features via Lightweight Profiling
TAG SET Index Offset
S0 S4 S2
Time
A[0][0] A[4][0] A[2][0]Precise event
sampling (PEBS)
PMU
27
Hardware-based address sampling (Cont.)
A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]Memory
references
L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss
Exploiting Modern Hardware Features via Lightweight Profiling
TAG SET Index Offset
S0 S4 S2
Time
A[0][0] A[4][0] A[2][0]Precise event
sampling (PEBS)
PMU
Set conflict?
27
Observation: temporal pattern of conflict
Exploiting Modern Hardware Features via Lightweight Profiling
………
……
[0] [1] [2] [127]
[0]
[1]
[2]
[20,000]
Array [20,000][128]
0 16 32 48 0 16 32 48 0 0 017 34 51 4 21 48 …
Set mapping
0 1 15
16 17 31
32 33 47
48 49 63
0 1 15
16 17 31
48 49 63…
128
Set mappingafter padding
0 1 15 16
17 32 33
49
51 52 2
4 5 19
21 22 36
55 56 6
18
34 35 50
20
37
7
3
…
Pad128
TimeTime
28
Observation: temporal pattern of conflict (cont.)
Exploiting Modern Hardware Features via Lightweight Profiling
0 16 32 48 0 16 32 48 0
0 017 34 51 4 21 48 …
Conflict
No conflict
Time
Time
29
Observation: temporal pattern of conflict (cont.)
Exploiting Modern Hardware Features via Lightweight Profiling
0 16 32 48 0 16 32 48 0
0 017 34 51 4 21 48 …
Conflict
No conflict
Time
Time
29
Observation: temporal pattern of conflict (cont.)
Exploiting Modern Hardware Features via Lightweight Profiling
0 16 32 48 0 16 32 48 0
0 017 34 51 4 21 48 …
Distance = 3 Distance = 3
Conflict
No conflict
Time
Time
29
Observation: temporal pattern of conflict (cont.)
Exploiting Modern Hardware Features via Lightweight Profiling
0 16 32 48 0 16 32 48 0
0 017 34 51 4 21 48 …
Distance = 3 Distance = 3
Distance = 63
Conflict
No conflict
Time
Time
29
Re-conflict Distance (RCD)
• Number of cache misses in other cache sets between twoconsecutive misses in one particular set
S1 S2 S0 S1 S3 S2 S1 S0
RCD=2 RCD=2
Exploiting Modern Hardware Features via Lightweight Profiling
Set Misses Time
30
Re-conflict Distance (RCD)
• Number of cache misses in other cache sets between twoconsecutive misses in one particular set
S1 S2 S0 S1 S3 S2 S1 S0
RCD=2 RCD=2
Exploiting Modern Hardware Features via Lightweight Profiling
Set Misses Time
PMU S1 S1 S2 S3
RCD=0
Time
30
Re-conflict Distance (RCD)
• Number of cache misses in other cache sets between twoconsecutive misses in one particular set
S1 S2 S0 S1 S3 S2 S1 S0
RCD=2 RCD=2
Exploiting Modern Hardware Features via Lightweight Profiling
Set Misses Time
PMU S1 S1 S2 S3
RCD=0
Time
Approximate RCD
30
RCD and it’s contributionRCD Count Is conflict?
Short Large Yes
Short Small No
Long ~ No
Exploiting Modern Hardware Features via Lightweight Profiling31
RCD and it’s contributionRCD Count Is conflict?
Short Large Yes
Short Small No
Long ~ No
Exploiting Modern Hardware Features via Lightweight Profiling
Regression model
Benchmark RCD
Application RCD ModelConflict
No-Conflict
Pre
dic
tio
nTr
ain
ing
31
Case Study: PolyBench/C -ADI
Exploiting Modern Hardware Features via Lightweight Profiling
CCPROF PREDICTS >>> *** CONFLICT MISS *** in LOOP(line: 102). Loop contribution is *** HIGH *** 94.26%CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 108). Loop's contribution to total L1 miss: 3.13%CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 117). Loop's contribution to total L1 miss: 0.86%CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 122). Loop's contribution to total L1 miss: 1.74%
32
Case Study: PolyBench/C -ADI
Exploiting Modern Hardware Features via Lightweight Profiling
CCPROF PREDICTS >>> *** CONFLICT MISS *** in LOOP(line: 102). Loop contribution is *** HIGH *** 94.26%CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 108). Loop's contribution to total L1 miss: 3.13%CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 117). Loop's contribution to total L1 miss: 0.86%CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 122). Loop's contribution to total L1 miss: 1.74%
32
RCD – before and after optimization
Speedup: 3× Speedup: 1.26×Speedup: 1.13×
Speedup: 1.09×* Speedup: 94.6×* Speedup: 1.12×
*Loop level SpeedupExploiting Modern Hardware Features via Lightweight Profiling33
RCD – before and after optimization
Speedup: 3× Speedup: 1.26×Speedup: 1.13×
Speedup: 1.09×* Speedup: 94.6×* Speedup: 1.12×
*Loop level Speedup
Median overhead: 37%Compare with simulation: 38x
Exploiting Modern Hardware Features via Lightweight Profiling33
Outline
✓Lightweight profiling
✓SMT-aware optimization
✓Detection of cache conflicts
•Guiding data-structure layout transformation
Exploiting Modern Hardware Features via Lightweight Profiling34
StructSlim: A lightweight profiler to guide structure splitting
Probir Roy , Xu Liu[CGO – 2016]
LWPTool: A Lightweight Profiler to Guide Data Layout Optimization
Chao Yu, Probir Roy, Yuebin Bai, Hailong Yang, Xu Liu[TPDS – 2018]
Exploiting Modern Hardware Features via Lightweight Profiling35
StructSlim: A lightweight profiler to guide structure splitting
Probir Roy , Xu Liu[CGO – 2016]
LWPTool: A Lightweight Profiler to Guide Data Layout Optimization
Chao Yu, Probir Roy, Yuebin Bai, Hailong Yang, Xu Liu[TPDS – 2018]
Exploiting Modern Hardware Features via Lightweight Profiling35
Inefficient data-structure
struct type {int a; int b; int c; int d;};struct type Arr[N];for (i = 0; i < N; i++)
B[i] = Arr[i].a + Arr[i].c;
a b c d a b c d
L1 cache
Exploiting Modern Hardware Features via Lightweight Profiling
Cache line
36
Inefficient data-structure
struct type {int a; int b; int c; int d;};struct type Arr[N];for (i = 0; i < N; i++)
B[i] = Arr[i].a + Arr[i].c;
a b c d a b c d
L1 cache
Exploiting Modern Hardware Features via Lightweight Profiling
Cache line
Utilization = 50%
36
Inefficient data-structure
struct type {int a; int b; int c; int d;};struct type Arr[N];for (i = 0; i < N; i++)
B[i] = Arr[i].a + Arr[i].c;
a b c d a b c d
L1 cache
Exploiting Modern Hardware Features via Lightweight Profiling
Cache line
a c a c a c a c
L1 cache
struct type_part1 {int a; int c;};struct type_part2 {int b; int d;};
Split structure
Cache line
Utilization = 50%
36
Inefficient data-structure
struct type {int a; int b; int c; int d;};struct type Arr[N];for (i = 0; i < N; i++)
B[i] = Arr[i].a + Arr[i].c;
a b c d a b c d
L1 cache
Exploiting Modern Hardware Features via Lightweight Profiling
Cache line
a c a c a c a c
L1 cache
struct type_part1 {int a; int c;};struct type_part2 {int b; int d;};
Split structure
Cache line
Utilization = 50%
Utilization = 100%
36
Structure splitting- Questions to ask
Exploiting Modern Hardware Features via Lightweight Profiling
How to split structure?Which data structures are significant?
Which fields to keep together?
37
Structure splitting- Questions to ask
Exploiting Modern Hardware Features via Lightweight Profiling
How to split structure?Which data structures are significant?
High usageWhich fields to keep together?
37
Structure splitting- Questions to ask
Exploiting Modern Hardware Features via Lightweight Profiling
How to split structure?Which data structures are significant?
High usageWhich fields to keep together?
Loop level analysis Field affinity
37
Structure splitting- Questions to ask
Exploiting Modern Hardware Features via Lightweight Profiling
How to split structure?Which data structures are significant?
High usageWhich fields to keep together?
Loop level analysis Field affinity
Field 1
Field 2
Field 3Loop 1 Loop 2
37
Structure splitting- Questions to ask
Exploiting Modern Hardware Features via Lightweight Profiling
How to split structure?Which data structures are significant?
High usageWhich fields to keep together?
Loop level analysis Field affinity
Field 1
Field 2
Field 3Loop 1 Loop 2
Field 1
Field 2
Field 2
Field 3
or
37
Structure splitting- Questions to ask
Exploiting Modern Hardware Features via Lightweight Profiling
How to split structure?Which data structures are significant?
High usageWhich fields to keep together?
Loop level analysis Field affinity
Field 1
Field 2
Field 3Loop 1 Loop 2
Field 1
Field 2
Field 2
Field 3
or
90%
10%37
Structure splitting- Questions to ask
Exploiting Modern Hardware Features via Lightweight Profiling
How to split structure?Which data structures are significant?
High usageWhich fields to keep together?
Loop level analysis Field affinity
Field 1
Field 2
Field 3Loop 1 Loop 2
Field 1
Field 2
Field 2
Field 3
or
90%
10%37
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling38
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
38
S S S SApplication Time
Reference Type
Data Address
Instruction Pointer
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
38
S S S SApplication Time
Reference Type
Data Address
Instruction Pointer
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
PMULoop 1
Loop 2
38
S S S SApplication Time
Reference Type
Data Address
Instruction Pointer
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
PMULoop 1
Loop 2
Mem AllocationMonitor
38
S S S SApplication Time
Reference Type
Data Address
Instruction Pointer
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
PMULoop 1
Loop 2
Heap
Mem AllocationMonitor
38
S S S SApplication Time
Reference Type
Data Address
Instruction Pointer
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
PMULoop 1
Loop 2
Heap
Mem AllocationMonitor
38
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
PMU
Loop 1
Loop 2
Heap
Struct 1
Struct 2
S
S
S
S
S
S
SS
S = sample
39
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
S = sample
PMUHeap
Struct 1
S
SS
SS
Distances
40
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
S = sample
PMUHeap
Struct 1
S
SS
SS
Distances
Distances
Field offset
40
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
S = sample
PMUHeap
Struct 1
S
SS
SS
Distances
Field 1
Field 2
Field 3Loop 1 Loop 2
Distances
Field offset
40
Code-centric and data-centric attribution
Exploiting Modern Hardware Features via Lightweight Profiling
S = sample
PMUHeap
Struct 1
S
SS
SS
Distances
Field 1
Field 2
Field 3Loop 1 Loop 2
Distances
Field offset
90%
10%40
Case study: SPEC CPU 2000 ART
Loops with line numbers
Latency percentage Accessed fields
131-138 1.59% U,P
559-570 8.42% X,Q
553-554 1.98% W
545-548 10.83% U, I
615-616 56.57% P
607-608 14.40% P
589-592 2.25% U, P
575-576 3.72% V
1015-1016 0.24% I
Exploiting Modern Hardware Features via Lightweight Profiling
typedef struct{
double *I; double W; double X; double V; double U; double P; double Q; double R;}f1_neuron
U
I P
Q X
R W V
86% 5%
100%
Affinity graph
41
Case study: SPEC CPU 2000 ART
Loops with line numbers
Latency percentage Accessed fields
131-138 1.59% U,P
559-570 8.42% X,Q
553-554 1.98% W
545-548 10.83% U, I
615-616 56.57% P
607-608 14.40% P
589-592 2.25% U, P
575-576 3.72% V
1015-1016 0.24% I
Exploiting Modern Hardware Features via Lightweight Profiling
typedef struct{
double *I; double W; double X; double V; double U; double P; double Q; double R;}f1_neuron
U
I P
Q X
R W V
86% 5%
100%
Affinity graph
typedef struct{ double *I; double U;} f1_neuron_IU;typedef struct{ double Q; double X;} f1_neuron_QX;typedef struct{ double P;} f1_neuron_P;typedef struct{ double V;} f1_neuron_V;typedef struct{ double W;} f1_neuron_W;typedef struct{ double R;} f1_neuron_R;
41
Benchmarks: speedup, overhead, cache missBenchmarks Speedups Runtime
overheadL1 miss
reductionL2 miss
reduction
179.ART 1.37× 2.05% 46.5% 51.1%
462.Libquantum 1.09× 2.79% 49% 82.6%
TSP 1.09× 2.42% 13.3% 19.9%
Mser 1.03× 2.95% 8.3% 8.4%
CLOMP 1.2 1.25× 16.1% 15.5% 26.4%
Health 1.12× 18.3% 66.7% 90.8%
NN 1.33× 5.21% 87.2% 98.0%
Average 1.18× 7.1%
Exploiting Modern Hardware Features via Lightweight Profiling
gcc -O3
Yan, Jianian, Jiangzhou He, Wenguang Chen, Pen-Chung Yew, and Weimin Zheng. "ASLOP: A field-access affinity-based structure data layout optimizer."
Related work: Overhead: average 4x
42
Conclusions
Exploiting Modern Hardware Features via Lightweight ProfilingLow overhead
Deep insight
Simulation methods(PinTool, GPGPUSim,
GEMS)
Measurement methods(Perf, Oprofile, PAPI)
Shallow insight
High overhead
SMTAnalyzer
StructSlim
CCProf
Lightweight profiling with PMUs can provide deep insights into performance issues cause by memory hierarchies and poor algorithm choice.
43
Publications
• [CGO'18] "Lightweight Detection of Cache Conflicts", Probir Roy, Shuaiwen Leon Song, SriramKrishnamoorthy and Xu Liu, The 2018 International Symposium on Code Generation andOptimization, Feb 24 - 28th, 2018, Vienna, Austria. Acceptance ratio: 28%.
• [TACO'18] "NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks", Probir Roy, ShuaiwenLeon Song, Sriram Krishnamoorthy, Abhinav Vishnu, Dipanjan Sengupta, Xu Liu, ACM Transactionson Architecture and Code Optimization, 2018.
• [TPDS'18] "LWPTool: A Lightweight Profiler to Guide Data Layout Optimization", Chao Yu, ProbirRoy, Yuebin Bai, Hailong Yang, Xu Liu, IEEE Transactions on Parallel and Distributed Systems, 2018.
• [HPDC'16] "SMT-Aware Instantaneous Footprint Optimization", Probir Roy, Xu Liu and ShuaiwenLeon Song, The 25th ACM International Symposium on High-Performance and DistributedComputing, May 31 - Jun 4, 2016, Kyoto, Japan. Acceptance ratio: 15.5% (20/129).
• [CGO'16] "StructSlim: A Lightweight Profiler to Guide Structure Splitting", Probir Roy and Xu Liu,The 2016 International Symposium on Code Generation and Optimization, Mar 12-18, 2016,Barcelona, Spain. Acceptance ratio: 23%.
Exploiting Modern Hardware Features via Lightweight Profiling44
Publications
• [CGO'18] "Lightweight Detection of Cache Conflicts", Probir Roy, Shuaiwen Leon Song, SriramKrishnamoorthy and Xu Liu, The 2018 International Symposium on Code Generation andOptimization, Feb 24 - 28th, 2018, Vienna, Austria. Acceptance ratio: 28%.
• [TACO'18] "NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks", Probir Roy, ShuaiwenLeon Song, Sriram Krishnamoorthy, Abhinav Vishnu, Dipanjan Sengupta, Xu Liu, ACM Transactionson Architecture and Code Optimization, 2018.
• [TPDS'18] "LWPTool: A Lightweight Profiler to Guide Data Layout Optimization", Chao Yu, ProbirRoy, Yuebin Bai, Hailong Yang, Xu Liu, IEEE Transactions on Parallel and Distributed Systems, 2018.
• [HPDC'16] "SMT-Aware Instantaneous Footprint Optimization", Probir Roy, Xu Liu and ShuaiwenLeon Song, The 25th ACM International Symposium on High-Performance and DistributedComputing, May 31 - Jun 4, 2016, Kyoto, Japan. Acceptance ratio: 15.5% (20/129).
• [CGO'16] "StructSlim: A Lightweight Profiler to Guide Structure Splitting", Probir Roy and Xu Liu,The 2016 International Symposium on Code Generation and Optimization, Mar 12-18, 2016,Barcelona, Spain. Acceptance ratio: 23%.
Exploiting Modern Hardware Features via Lightweight Profiling
???
?44
Challenges ahead
• Program analysis for declarative programming languages• Domain specific languages provide high-level abstraction
• Machine learning (PyTorch), HPC (HDF5), big-data (SQL)
• Analyzing and optimizing data center and cloud application• Resource utilization/scheduling in multi-tenant environment
• Heterogenous architecture resource management
• Security analysis• Program analysis to identify vulnerable source code
• Analysis of emerging hardware• GPU, FPGA, Tensor processing units
Exploiting Modern Hardware Features via Lightweight Profiling45
Top Related