Big Concept Slide QUICK TUTORIAL OPENCONTRAIL QUICK TUTORIAL Contrail Virtual Networking.
TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48...
Transcript of TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48...
![Page 1: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/1.jpg)
TUNING SLIDE
![Page 2: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/2.jpg)
MICRO-48 Tutorial
December 5, 2015
Fast and Accurate Microarchitectural
Simulation with ZSim
Daniel Sanchez, Nathan Beckmann,
Anurag Mukkara, Po-An Tsai
MIT CSAIL
![Page 3: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/3.jpg)
Welcome!
![Page 4: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/4.jpg)
Agenda4
8:30 – 9:10 Intro and Overview
9:10 – 9:25 Simulator Organization
9:25 – 10:00 Core Models
10:00 – 10:20 Break / Q&A
10:20 – 11:00 Memory System
11:00 – 11:20 Configuration and Stats
11:20 – 11:40 Validation
11:40 – 12:00 Q&A
![Page 5: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/5.jpg)
Introduction and Overview
5
![Page 6: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/6.jpg)
Motivation6
Current detailed simulators are slow (~200 KIPS)
![Page 7: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/7.jpg)
Motivation6
Current detailed simulators are slow (~200 KIPS)
Simulation performance wall
More complex targets (multicore, memory hierarchy, …)
Hard to parallelize
![Page 8: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/8.jpg)
Motivation6
Current detailed simulators are slow (~200 KIPS)
Simulation performance wall
More complex targets (multicore, memory hierarchy, …)
Hard to parallelize
Problem: Time to simulate 1000 cores @ 2GHz for 1s at
200 KIPS: 4 months
![Page 9: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/9.jpg)
Motivation6
Current detailed simulators are slow (~200 KIPS)
Simulation performance wall
More complex targets (multicore, memory hierarchy, …)
Hard to parallelize
Problem: Time to simulate 1000 cores @ 2GHz for 1s at
200 KIPS: 4 months
200 MIPS: 3 hours
![Page 10: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/10.jpg)
Motivation6
Current detailed simulators are slow (~200 KIPS)
Simulation performance wall
More complex targets (multicore, memory hierarchy, …)
Hard to parallelize
Problem: Time to simulate 1000 cores @ 2GHz for 1s at
200 KIPS: 4 months
200 MIPS: 3 hours
Alternatives?
FPGAs: Fast, good progress, but still hard to use
Simplified/abstract models: Fast but inaccurate
![Page 11: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/11.jpg)
ZSim Techniques7
Three techniques to make 1000-core simulation practical:
1. Detailed DBT-accelerated core models to speed up sequential
simulation
2. Bound-weave to scale parallel simulation
3. Lightweight user-level virtualization to bridge user-level/full-
system gap
![Page 12: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/12.jpg)
ZSim Techniques7
Three techniques to make 1000-core simulation practical:
1. Detailed DBT-accelerated core models to speed up sequential
simulation
2. Bound-weave to scale parallel simulation
3. Lightweight user-level virtualization to bridge user-level/full-
system gap
ZSim achieves high performance and accuracy:
Simulates 1024-core systems at 10s-1000s of MIPS
100-1000x faster than current simulators
Validated against real Westmere system, avg error ~10%
![Page 13: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/13.jpg)
This Presentation is Also a Demo!8
ZSim is simulating these slides
OOO Westmere cores running at 2 GHz
3-level cache hierarchy
Will illustrate other features as I present them
Total cycles and instructions
simulated (in billions)
Current simulation speed and basic stats
(updated every 500ms)
![Page 14: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/14.jpg)
This Presentation is Also a Demo!8
ZSim is simulating these slides
OOO Westmere cores running at 2 GHz
3-level cache hierarchy
Will illustrate other features as I present them
Total cycles and instructions
simulated (in billions)
Current simulation speed and basic stats
(updated every 500ms)
Busy (> 0.9 cores active)
0.1 < cores active < 0.9
Idle (< 0.1 cores active)
![Page 15: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/15.jpg)
This Presentation is Also a Demo!8
ZSim is simulating these slides
OOO Westmere cores running at 2 GHz
3-level cache hierarchy
Will illustrate other features as I present them
Total cycles and instructions
simulated (in billions)
Current simulation speed and basic stats
(updated every 500ms)
ZSim performance relevant when busy
Running on 2-core laptop CPU @ 1.7 GHz
~12x slower than 16-core server @ 2.6 GHz
Busy (> 0.9 cores active)
0.1 < cores active < 0.9
Idle (< 0.1 cores active)
!
![Page 16: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/16.jpg)
Main Design Decisions9
General execution-driven simulator:
Functional
model
Timing
model
![Page 17: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/17.jpg)
Main Design Decisions9
General execution-driven simulator:
Functional
model
Timing
model
Emulation? (e.g., gem5, MARSSx86)
Instrumentation? (e.g., Graphite, Sniper)
![Page 18: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/18.jpg)
Main Design Decisions9
General execution-driven simulator:
Functional
model
Timing
model
Emulation? (e.g., gem5, MARSSx86)
Instrumentation? (e.g., Graphite, Sniper)
Functional model “for free”
Base ISA = Host ISA (x86)
Dynamic Binary Translation (Pin)
![Page 19: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/19.jpg)
Main Design Decisions9
General execution-driven simulator:
Functional
model
Timing
model
Emulation? (e.g., gem5, MARSSx86)
Instrumentation? (e.g., Graphite, Sniper)
Cycle-driven?
Event-driven?
Functional model “for free”
Base ISA = Host ISA (x86)
Dynamic Binary Translation (Pin)
![Page 20: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/20.jpg)
Main Design Decisions9
General execution-driven simulator:
Functional
model
Timing
model
Emulation? (e.g., gem5, MARSSx86)
Instrumentation? (e.g., Graphite, Sniper)
Cycle-driven?
Event-driven?
Functional model “for free”
Base ISA = Host ISA (x86)
DBT-accelerated,
instruction-driven core
+
Event-driven uncore
Dynamic Binary Translation (Pin)
![Page 21: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/21.jpg)
Outline10
Introduction
Detailed DBT-accelerated core models
Bound-weave parallelization
Lightweight user-level virtualization
![Page 22: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/22.jpg)
Shift most of the work to DBT instrumentation phase
Accelerating Core Models11
mov (%rbp),%rcx
add %rax,%rbx
mov %rdx,(%rbp)
ja 40530a
Load(addr = (%rbp))
mov (%rbp),%rcx
add %rax,%rdx
Store(addr = (%rbp))
mov %rdx,(%rbp)
BasicBlock(BBLDescriptor)
ja 10840530a
Basic block Instrumented basic block Basic block descriptor
Insµop decoding
µop dependencies,
functional units, latency
Front-end delays
+
![Page 23: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/23.jpg)
Shift most of the work to DBT instrumentation phase
Instruction-driven models: Simulate all stages at once for each
instruction/ µop
Accelerating Core Models11
mov (%rbp),%rcx
add %rax,%rbx
mov %rdx,(%rbp)
ja 40530a
Load(addr = (%rbp))
mov (%rbp),%rcx
add %rax,%rdx
Store(addr = (%rbp))
mov %rdx,(%rbp)
BasicBlock(BBLDescriptor)
ja 10840530a
Basic block Instrumented basic block Basic block descriptor
Insµop decoding
µop dependencies,
functional units, latency
Front-end delays
+
![Page 24: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/24.jpg)
Shift most of the work to DBT instrumentation phase
Instruction-driven models: Simulate all stages at once for each
instruction/ µop
Accurate even with OOO if instruction window prioritizes older instructions
Faster, but more complex than cycle-driven
Accelerating Core Models11
mov (%rbp),%rcx
add %rax,%rbx
mov %rdx,(%rbp)
ja 40530a
Load(addr = (%rbp))
mov (%rbp),%rcx
add %rax,%rdx
Store(addr = (%rbp))
mov %rdx,(%rbp)
BasicBlock(BBLDescriptor)
ja 10840530a
Basic block Instrumented basic block Basic block descriptor
Insµop decoding
µop dependencies,
functional units, latency
Front-end delays
+
![Page 25: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/25.jpg)
Detailed OOO Model12
OOO core modeled and validated against Westmere
Main Features
Fetch
Decode
Issue
OOO
Exec
Commit
Wrong-path fetches
Branch Prediction
Front-end delays (predecoder, decoder)
Detailed instruction to µop decoding
Rename/capture stalls
IW with limited size and width
Functional unit delays and contention
Detailed LSU (forwarding, fences,…)
Reorder buffer with limited size and width
![Page 26: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/26.jpg)
Detailed OOO Model13
OOO core modeled and validated against Westmere
Fetch
Decode
Issue
OOO
Exec
Commit
![Page 27: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/27.jpg)
Detailed OOO Model13
OOO core modeled and validated against Westmere
Fetch
Decode
Issue
OOO
Exec
Commit
Fundamentally Hard to Model
Wrong-path execution
![Page 28: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/28.jpg)
Detailed OOO Model13
OOO core modeled and validated against Westmere
Fetch
Decode
Issue
OOO
Exec
Commit
Fundamentally Hard to Model
Wrong-path execution
In Westmere, wrong-path instructions don’t
affect recovery latency or pollute caches
Skipping OK
![Page 29: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/29.jpg)
Detailed OOO Model13
OOO core modeled and validated against Westmere
Fetch
Decode
Issue
OOO
Exec
Commit
Fundamentally Hard to Model
Wrong-path execution
Rarely used
instructions
BTB
LSD
TLBs
In Westmere, wrong-path instructions don’t
affect recovery latency or pollute caches
Skipping OK
Not Modeled (Yet)
![Page 30: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/30.jpg)
Single-Thread Accuracy14
8.5% average IPC error, max 26%, 21/29 within 10%
29 SPEC CPU2006 apps for 50 Billion instructions
Real: Xeon L5640 (Westmere), 3x DDR3-1333, no HT
Simulated: OOO cores @ 2.27 GHz, detailed uncore
![Page 31: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/31.jpg)
Single-Thread Performance15
Host: E5-2670 @ 2.6 GHz (single-thread simulation)
29 SPEC CPU2006 apps for 50 Billion instructions
![Page 32: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/32.jpg)
Single-Thread Performance15
Host: E5-2670 @ 2.6 GHz (single-thread simulation)
29 SPEC CPU2006 apps for 50 Billion instructions
40 MIPS hmean
12 MIPS hmean
![Page 33: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/33.jpg)
Single-Thread Performance15
Host: E5-2670 @ 2.6 GHz (single-thread simulation)
29 SPEC CPU2006 apps for 50 Billion instructions
40 MIPS hmean
12 MIPS hmean
~10-100x faster
![Page 34: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/34.jpg)
Single-Thread Performance15
Host: E5-2670 @ 2.6 GHz (single-thread simulation)
29 SPEC CPU2006 apps for 50 Billion instructions
40 MIPS hmean
12 MIPS hmean
~3x between least and
most detailed models!
~10-100x faster
![Page 35: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/35.jpg)
Outline16
Introduction
Detailed DBT-accelerated core models
Bound-weave parallelization
Lightweight user-level virtualization
![Page 36: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/36.jpg)
Parallelization Techniques17
Parallel Discrete Event Simulation (PDES):
Divide components across host threads
Execute events from each component
maintaining illusion of full order
Core 1Core 0
Mem 0
L3 Bank 0 L3 Bank 1
![Page 37: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/37.jpg)
Parallelization Techniques17
Parallel Discrete Event Simulation (PDES):
Divide components across host threads
Execute events from each component
maintaining illusion of full order
Core 1Core 0
Mem 0
L3 Bank 0 L3 Bank 1
Host
Thread 0
Host
Thread 1
![Page 38: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/38.jpg)
Parallelization Techniques17
Parallel Discrete Event Simulation (PDES):
Divide components across host threads
Execute events from each component
maintaining illusion of full order
Core 1Core 0
Mem 0
L3 Bank 0 L3 Bank 1
Host
Thread 0
Host
Thread 1
5 10
15 15
10 5Skew < 10 cycles
![Page 39: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/39.jpg)
Parallelization Techniques17
Parallel Discrete Event Simulation (PDES):
Divide components across host threads
Execute events from each component
maintaining illusion of full order
Core 1Core 0
Mem 0
L3 Bank 0 L3 Bank 1
Host
Thread 0
Host
Thread 1
5 10
15 15
10 5Accurate
Not scalableSkew < 10 cycles
![Page 40: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/40.jpg)
Parallelization Techniques17
Parallel Discrete Event Simulation (PDES):
Divide components across host threads
Execute events from each component
maintaining illusion of full order
Lax synchronization: Allow skews above inter-component
latencies, tolerate ordering violations
Core 1Core 0
Mem 0
L3 Bank 0 L3 Bank 1
Host
Thread 0
Host
Thread 1
5 10
15 15
10 5
Scalable
Inaccurate
Accurate
Not scalableSkew < 10 cycles
![Page 41: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/41.jpg)
Characterizing Interference18
Path-altering interference
If we simulate two accesses out of order, their
paths through the memory hierarchy change
GETS A
HIT
Core 0
LLC
Mem
GETS A
MISS
1 2
Core1
![Page 42: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/42.jpg)
Characterizing Interference18
Path-altering interference
If we simulate two accesses out of order, their
paths through the memory hierarchy change
GETS A
HIT
Core 0
LLC
Mem
GETS A
MISS
1 2
Core1 Core 0
LLC
Mem
GETS A
HIT
GETS A
MISS
2 1
Core 1
![Page 43: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/43.jpg)
Characterizing Interference18
Path-altering interference
If we simulate two accesses out of order, their
paths through the memory hierarchy change
Path-preserving interference
If we simulate two accesses out of order, their
timing changes but their paths do not
GETS A
HIT
Core 0
LLC
Mem
GETS A
MISS
1 2
Core1 Core 0
LLC
Mem
GETS A
HIT
GETS A
MISS
2 1
Core 1
GETS B
HIT
Core 0
LLC (blocking)
Mem
GETS A
MISS
1 2
Core 1
3 4
5
6
![Page 44: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/44.jpg)
Characterizing Interference18
Path-altering interference
If we simulate two accesses out of order, their
paths through the memory hierarchy change
Path-preserving interference
If we simulate two accesses out of order, their
timing changes but their paths do not
GETS A
HIT
Core 0
LLC
Mem
GETS A
MISS
1 2
Core1 Core 0
LLC
Mem
GETS A
HIT
GETS A
MISS
2 1
Core 1
GETS B
HIT
Core 0
LLC (blocking)
Mem
GETS A
MISS
1 2
Core 1
3 4
5
6GETS B
HIT
Core 0
LLC (blocking)
Mem
GETS A
MISS
2 1
Core 1
4 5
6
3
![Page 45: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/45.jpg)
Characterizing Interference19
Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores):
1 in10K accesses
![Page 46: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/46.jpg)
Characterizing Interference19
Path-altering interference extremely rare in small intervals
Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores):
1 in10K accesses
![Page 47: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/47.jpg)
Characterizing Interference19
Path-altering interference extremely rare in small intervals
Strategy:
Simulate path-preserving interference faithfully
Ignore (but optionally profile) path-altering interference
Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores):
1 in10K accesses
![Page 48: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/48.jpg)
Bound-Weave Parallelization20
Divide simulation in small intervals (e.g., 1000 cycles)
Two parallel phases per interval: Bound and weave
![Page 49: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/49.jpg)
Bound-Weave Parallelization20
Divide simulation in small intervals (e.g., 1000 cycles)
Two parallel phases per interval: Bound and weave
Bound phase: Find paths
Weave phase: Find timings
![Page 50: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/50.jpg)
Bound-Weave Parallelization20
Divide simulation in small intervals (e.g., 1000 cycles)
Two parallel phases per interval: Bound and weave
Bound-Weave equivalent to PDES
for path-preserving interference
Bound phase: Find paths
Weave phase: Find timings
![Page 51: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/51.jpg)
Bound-Weave Example21
2-core host simulating4-core system
1000-cycle intervals
Core 1
L1I
Core 0 Core 2 Core 3
L1D L1I L1D L1I L1D L1I L1D
Mem Ctrl 0 Mem Ctrl 1
L2 L2 L2 L2
L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3
![Page 52: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/52.jpg)
Bound-Weave Example21
2-core host simulating4-core system
1000-cycle intervals
Divide components
among 2 domains Core 1
L1I
Core 0 Core 2 Core 3
L1D L1I L1D L1I L1D L1I L1D
Mem Ctrl 0 Mem Ctrl 1
L2 L2 L2 L2
L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3
Domain 0 Domain 1
![Page 53: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/53.jpg)
Bound-Weave Example21
2-core host simulating4-core system
1000-cycle intervals
Divide components
among 2 domains Core 1
L1I
Core 0 Core 2 Core 3
L1D L1I L1D L1I L1D L1I L1D
Mem Ctrl 0 Mem Ctrl 1
L2 L2 L2 L2
L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3
Domain 0 Domain 1
Host Thread 0
Host Thread 1Host
Time
![Page 54: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/54.jpg)
Bound-Weave Example21
2-core host simulating4-core system
1000-cycle intervals
Divide components
among 2 domains Core 1
L1I
Core 0 Core 2 Core 3
L1D L1I L1D L1I L1D L1I L1D
Mem Ctrl 0 Mem Ctrl 1
L2 L2 L2 L2
L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3
Domain 0 Domain 1
Core 0
Core 3
Core 1
Core 2
Bound Phase: Parallel simulation until cycle
1000, gather access traces
Host Thread 0
Host Thread 1Host
Time
![Page 55: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/55.jpg)
Bound-Weave Example21
2-core host simulating4-core system
1000-cycle intervals
Divide components
among 2 domains Core 1
L1I
Core 0 Core 2 Core 3
L1D L1I L1D L1I L1D L1I L1D
Mem Ctrl 0 Mem Ctrl 1
L2 L2 L2 L2
L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3
Domain 0 Domain 1
Core 0
Core 3
Core 1
Core 2
Bound Phase: Parallel simulation until cycle
1000, gather access traces
Domain 0
Domain 1
Weave Phase: Parallel event-driven simulation of
gathered traces until actual cycle 1000
Host Thread 0
Host Thread 1Host
Time
![Page 56: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/56.jpg)
Bound-Weave Example21
2-core host simulating4-core system
1000-cycle intervals
Divide components
among 2 domains Core 1
L1I
Core 0 Core 2 Core 3
L1D L1I L1D L1I L1D L1I L1D
Mem Ctrl 0 Mem Ctrl 1
L2 L2 L2 L2
L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3
Domain 0 Domain 1
Core 0
Core 3
Core 1
Core 2
Bound Phase: Parallel simulation until cycle
1000, gather access traces
Domain 0
Domain 1
Weave Phase: Parallel event-driven simulation of
gathered traces until actual cycle 1000
Feedback: Adjust core cycles
Host Thread 0
Host Thread 1Host
Time
![Page 57: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/57.jpg)
Bound-Weave Example21
2-core host simulating4-core system
1000-cycle intervals
Divide components
among 2 domains Core 1
L1I
Core 0 Core 2 Core 3
L1D L1I L1D L1I L1D L1I L1D
Mem Ctrl 0 Mem Ctrl 1
L2 L2 L2 L2
L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3
Domain 0 Domain 1
Core 0
Core 3
Core 1
Core 2
Bound Phase: Parallel simulation until cycle
1000, gather access traces
Domain 0
Domain 1
Weave Phase: Parallel event-driven simulation of
gathered traces until actual cycle 1000
Feedback: Adjust core cycles
Bound Phase
(until cycle 2000)
…Core 3
Core 2 Core 0
Core 1Host Thread 0
Host Thread 1Host
Time
![Page 58: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/58.jpg)
Example: Bound Phase22
Host thread 0 simulates core 0, records trace:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
Edges fix minimum latency between events
Minimum L3 and main memory latencies (no interference)
20
20
20
3030 100
30 120
2020 20
40
![Page 59: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/59.jpg)
Example: Weave Phase23
Host threads simulate components from domains 0,1
Host threads only sync when needed
e.g., thread 1 simulates other events (not shown) until cycle 110, syncs
Lower bounds guarantee no order violations
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1
![Page 60: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/60.jpg)
Example: Weave Phase24
Delays propagate as events are simulated:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1
![Page 61: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/61.jpg)
Example: Weave Phase24
Delays propagate as events are simulated:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1Row miss +50 cycles
![Page 62: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/62.jpg)
Example: Weave Phase24
Delays propagate as events are simulated:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1Row miss +50 cycles
170
![Page 63: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/63.jpg)
Example: Weave Phase24
Delays propagate as events are simulated:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1Row miss +50 cycles
280
170
![Page 64: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/64.jpg)
Example: Weave Phase24
Delays propagate as events are simulated:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1Row miss +50 cycles
280
290
300
170
![Page 65: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/65.jpg)
Example: Weave Phase24
Delays propagate as events are simulated:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1Row miss +50 cycles
280
290
300
320
170
![Page 66: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/66.jpg)
Example: Weave Phase24
Delays propagate as events are simulated:
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
Core0 @ 290
L3b3 @ 270
HIT
20
20
20
3030 100
30 120
20
20 20
40
Host Thread 0
Host Thread 1Row miss +50 cycles
280
290
300
320
340
170
![Page 67: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/67.jpg)
Bound-Weave Scalability25
Bound phase scales almost linearly
Using novel shared-memory synchronization protocol (later)
Weave phase scales much better than PDES
Threads only need to sync when an event crosses domains
A lot of work shifted to bound phase
![Page 68: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/68.jpg)
Bound-Weave Scalability25
Bound phase scales almost linearly
Using novel shared-memory synchronization protocol (later)
Weave phase scales much better than PDES
Threads only need to sync when an event crosses domains
A lot of work shifted to bound phase
Need bound and weave models for each component, but
division is often very natural
e.g., caches: hit/miss on bound phase; MSHRs, pipelined
accesses, port contention on weave phase
![Page 69: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/69.jpg)
Bound-Weave Take-Aways26
Minimal synchronization:
Bound phase: Unordered accesses (like lax)
Weave: Only sync on actual dependencies
![Page 70: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/70.jpg)
Bound-Weave Take-Aways26
Minimal synchronization:
Bound phase: Unordered accesses (like lax)
Weave: Only sync on actual dependencies
No ordering violations in weave phase
![Page 71: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/71.jpg)
Bound-Weave Take-Aways26
Minimal synchronization:
Bound phase: Unordered accesses (like lax)
Weave: Only sync on actual dependencies
No ordering violations in weave phase
Works with standard event-driven models
e.g., 110 lines to integrate with DRAMSim2
![Page 72: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/72.jpg)
Multithreaded Accuracy27
23 apps: PARSEC, SPLASH-2, SPEC OMP2001, STREAM
11.2% avg perf error (not IPC), 10/23 within 10%
Similar differences as single-core results
![Page 73: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/73.jpg)
1024-Core Performance28
Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)
Results for the 14/23 parallel apps that scale
![Page 74: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/74.jpg)
1024-Core Performance28
Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)
Results for the 14/23 parallel apps that scale
200 MIPS hmean
41 MIPS hmean
![Page 75: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/75.jpg)
1024-Core Performance28
Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)
Results for the 14/23 parallel apps that scale
200 MIPS hmean
41 MIPS hmean
~100-1000x faster
![Page 76: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/76.jpg)
1024-Core Performance28
Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)
Results for the 14/23 parallel apps that scale
200 MIPS hmean
41 MIPS hmean
~5x between least and
most detailed models!
~100-1000x faster
![Page 77: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/77.jpg)
Bound-Weave Scalability29
![Page 78: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/78.jpg)
Bound-Weave Scalability29
10.1-13.6x speedup @ 16 cores
![Page 79: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/79.jpg)
Outline30
Introduction
Detailed DBT-accelerated core models
Bound-weave parallelization
Lightweight user-level virtualization
![Page 80: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/80.jpg)
Lightweight User-Level Virtualization31
No 1Kcore OSs
No parallel full-system DBT
ZSim has to be
user-level for now
![Page 81: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/81.jpg)
Lightweight User-Level Virtualization31
No 1Kcore OSs
No parallel full-system DBT
Problem: User-level simulators limited to simple workloads
Lightweight user-level virtualization: Bridge the gap with
full-system simulation
Simulate accurately if time spent in OS is minimal
ZSim has to be
user-level for now
![Page 82: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/82.jpg)
Lightweight User-Level Virtualization32
Multiprocess workloads
Scheduler (threads > cores)
Time virtualization
System virtualization
Simulator-OS deadlockavoidance
Signals
ISA extensions
Fast-forwarding
![Page 83: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/83.jpg)
ZSim Limitations33
Not implemented yet:
Multithreaded cores
Detailed NoC models
Virtual memory (TLBs)
![Page 84: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/84.jpg)
ZSim Limitations33
Not implemented yet:
Multithreaded cores
Detailed NoC models
Virtual memory (TLBs)
Fundamentally hard:
Systems or workloads with frequent path-altering interference
(e.g., fine-grained message-passing across whole chip)
Kernel-intensive applications
![Page 85: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/85.jpg)
Summary34
Three techniques to make 1Kcore simulation practical
DBT-accelerated models: 10-100x faster core models
Bound-weave parallelization: ~10-15x speedup from
parallelization with minimal accuracy loss
Lightweight user-level virtualization: Simulate complex
workloads without full-system support
ZSim achieves high performance and accuracy:
Simulates 1024-core systems at 10s-1000s of MIPS
Validated against real Westmere system, avg error ~10%
![Page 86: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/86.jpg)
Simulator Organization
35
![Page 87: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/87.jpg)
Main Components36
Harness
DriverSystem
Initialization
Config
Core timing
models
Memory system
timing models
Global
Memory
User-
level
virtualiz
ation
Stats
![Page 88: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/88.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
![Page 89: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/89.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
./build/opt/zsim test.cfg
![Page 90: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/90.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
zsim
./build/opt/zsim test.cfg
![Page 91: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/91.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
zsim
./build/opt/zsim test.cfg
Global Memory
![Page 92: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/92.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
zsim
./build/opt/zsim test.cfg
Global Memory
![Page 93: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/93.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
zsim
./build/opt/zsim test.cfg
process0 = {command = “ls”;
};
process1 = {command = “echo foo”;
};
…
Global Memory
![Page 94: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/94.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
zsim
./build/opt/zsim test.cfg
process0 = {command = “ls”;
};
process1 = {command = “echo foo”;
};
…
Global Memory
pin –t libzsim.so -- ls
![Page 95: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/95.jpg)
ZSim Harness37
Most of zsim implemented as
a pintool (libzsim.so)
A separate harness process
(zsim) controls simulation
Initializes global memory
Launches pin processes
Checks for deadlock
zsim
./build/opt/zsim test.cfg
process0 = {command = “ls”;
};
process1 = {command = “echo foo”;
};
…
Global Memory
pin –t libzsim.so -- ls
pin –t libzsim.so – echo foo
![Page 96: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/96.jpg)
Global Memory38
Pin processes communicate through a shared memory
segment, managed as a single global heap
All simulator objects must be allocated in the global heap
![Page 97: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/97.jpg)
Global Memory38
Pin processes communicate through a shared memory
segment, managed as a single global heap
All simulator objects must be allocated in the global heap
Process 0
address space
Program code
Local heap
Global heap
libzsim.so
![Page 98: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/98.jpg)
Global Memory38
Pin processes communicate through a shared memory
segment, managed as a single global heap
All simulator objects must be allocated in the global heap
Process 0
address space
Program code
Local heap
Global heap
libzsim.so
Process 1
address space
Program code
Local heap
Global heap
libzsim.so
![Page 99: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/99.jpg)
Global Memory38
Pin processes communicate through a shared memory
segment, managed as a single global heap
All simulator objects must be allocated in the global heap
Process 0
address space
Program code
Local heap
Global heap
libzsim.so
Process 1
address space
Program code
Local heap
Global heap
libzsim.so
![Page 100: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/100.jpg)
Global Memory38
Pin processes communicate through a shared memory
segment, managed as a single global heap
All simulator objects must be allocated in the global heap
Process 0
address space
Program code
Local heap
Global heap
libzsim.so
Process 1
address space
Program code
Local heap
Global heap
libzsim.so
Global heap and
libzsim.so code in
same memory
locations across all
processes Can
use normal pointers
& virtual functions
![Page 101: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/101.jpg)
Global Memory Allocation Idioms39
Globally-allocated objects: Inherit from GlobAlloc
class SimObject : GlobAlloc { …
![Page 102: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/102.jpg)
Global Memory Allocation Idioms39
Globally-allocated objects: Inherit from GlobAlloc
class SimObject : GlobAlloc { …
STL classes that allocate heap memory: Use g_stl variants
g_vector<uint64_t> cacheLines;
![Page 103: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/103.jpg)
Global Memory Allocation Idioms39
Globally-allocated objects: Inherit from GlobAlloc
class SimObject : GlobAlloc { …
STL classes that allocate heap memory: Use g_stl variants
g_vector<uint64_t> cacheLines;
C-style memory allocation (discouraged):
gm_malloc, gm_calloc, gm_free, …
![Page 104: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/104.jpg)
Global Memory Allocation Idioms39
Globally-allocated objects: Inherit from GlobAlloc
class SimObject : GlobAlloc { …
STL classes that allocate heap memory: Use g_stl variants
g_vector<uint64_t> cacheLines;
C-style memory allocation (discouraged):
gm_malloc, gm_calloc, gm_free, …
Declare globally-scoped variables under struct zinfo
![Page 105: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/105.jpg)
Initialization Sequence40
Harness
1
![Page 106: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/106.jpg)
Initialization Sequence40
Harness
1
Config
2
![Page 107: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/107.jpg)
Initialization Sequence40
Harness
1
Config
2
Global
Memory
3
![Page 108: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/108.jpg)
Initialization Sequence40
Harness
1
Config
2
Global
Memory
3
Driver
4
![Page 109: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/109.jpg)
Initialization Sequence40
Harness
1
Config
2
Global
Memory
3
Driver
4
User-
level
virtualiz
ation
5
![Page 110: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/110.jpg)
Initialization Sequence40
Harness
1
Config
2
Global
Memory
3
Driver
4System
Initialization
6
User-
level
virtualiz
ation
5
![Page 111: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/111.jpg)
Initialization Sequence40
Harness
1
Config
2
Global
Memory
3
Driver
4System
Initialization
6
User-
level
virtualiz
ation
5
Stats
7
![Page 112: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/112.jpg)
Initialization Sequence40
Harness
1
Config
2
Global
Memory
3
Driver
4System
Initialization
6
User-
level
virtualiz
ation
5
Stats
7
Memory system
timing models
8
![Page 113: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/113.jpg)
Initialization Sequence40
Harness
1
Config
2
Global
Memory
3
Driver
4System
Initialization
6
User-
level
virtualiz
ation
5
Stats
7
Memory system
timing models
8
Core timing
models
9
![Page 114: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/114.jpg)
Thanks For Your Attention!
Questions?
![Page 115: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/115.jpg)
Backup Slides
![Page 116: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/116.jpg)
Single-Thread Accuracy: Traces116
![Page 117: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/117.jpg)
Single-Thread Accuracy: Traces117
![Page 118: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/118.jpg)
Motivation118
Timeline:
2008: Decide to study 1K-core systems for my Ph.D. thesis
2009: Try every sim out there, none fast enough
Got M5+GEMS to 512 threads [ASPLOS 2010], barely usable
2010: Start developing ZSim [ZCache, MICRO 2010]
2011: Make ZSim flexible, scalable, develop detailed models, other groups start using it
2012: Let’s publish a paper and release it…
ZSim design approach:
Make judicious tradeoffs to achieve detailed 1K core sims efficiently
Verify that those tradeoffs result in minor inaccuracies
Disclaimer: Not a silver bullet & tradeoffs may not be accurate for your target system; you should validate the tradeoffs!
![Page 119: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/119.jpg)
Instruction-Driven Timing Models119
Cycle/event-driven models: Simulate all stages cycle by cycle
Instruction-driven models: Simulate all stages at once for each ins/uop
Each stage has separate clocks
Ordered queues (FetchQ, UopQ, LoadQ, StoreQ, ROB) model feedback loops between stages
Issue window tracks cycles each FU is used to determine dispatch cycle
Even with OOO, accurate if:
1. IW prioritizes older uops (OK)
2. uop exec times not affected by newer uops(OK except mem uops, ignore for now)
Fetc
h
Deco
de
Issu
e
OO
O
Exec
Com
mit
Instr code drives directly
DBT can accelerate better
Harder to develop
![Page 120: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/120.jpg)
DBT-based Acceleration120
With instruction-driven models, can push most overheads into instrumentation phase
mov -0x38(%rbp),%rcx
lea -0x2040(%rbp),%rdx
add %rax,%rbx
mov %rdx,-0x2068(%rbp)
cmp $0x1fff,%rax
jne 40530a
Load(addr = -0x38(%rbp))
mov -0x38(%rbp),%rcx
lea -0x2040(%rbp),%rdx
add %rax,%rdx
mov %rdx,-0x2068(%rbp)
Store(addr = -0x2068(%rbp))
cmp $0x1fff,%rax
BasicBlock(DecodedBBL)
jne 10840530a
Basic block descriptor
Type Src1 Src2 Dst1 Dst2 Lat PortMsk
Load rbp rcx 001000
Exec rbp rdx 3 110001
Exec rax rdx rdx rflgs 1 110001
StAddr rbp S0 1 000100
StData rdx S0 000010
Exec rax rip rip rflgs 1 000001
Instrumented code
Original code (1 basic block)
…
Predecoder/decoder delays
Instruction to uop fission
Instruction fusion
Uop dependencies, latency, ports
![Page 121: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/121.jpg)
Parallelization Techniques121
Parallel Discrete Event Simulation (PDES):
Core 1Core 0
Mem 0
L3 Bank 0 L3 Bank 1
Thread 0 Thread 1 Divide components across threads
Execute events from each component
maintaining illusion of full order
Pessimistic PDES: Keep skew between
threads below inter-component latency5 10
15 15
10 5
Optimistic PDES: Speculate & roll back
on ordering violations
Simple
Excessive sync
Less sync
Heavyweight
Lax synchronization: Allow skews above inter-component latencies,
tolerate ordering violations Scalable
Inaccurate
Accurate
Scales poorly
![Page 122: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/122.jpg)
Bound-Weave Parallelization122
Divide simulation in small intervals (e.g., 1000 cycles)
Two parallel phases per interval: Bound and weave
Bound phase:
Simulate each core independently using instruction-driven models
Record paths of all accesses through the memory hierarchy
Uncore models assume no interference, use minimum response time for all accesses puts lower bound on all events e.g., for a main memory access: uncontended caches, buses, row hit
Weave phase:
Perform parallel event-driven simulation of recorded events
Leverage prior knowledge of events to scale
Bound-Weave equivalent to PDES
for path-preserving interference
Find paths
Find timings
![Page 123: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/123.jpg)
Bound-Weave Example123
Weave phase: Events spread across two threads
Crossing events ( ) to only synchronize when needed
e.g., thread 1 reaches cycle 110, “L3b0 @ 80” not done checks thread 0’s progress, requeues itself later
Other synchronization-avoiding mechanisms in paper
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ Mem0 @ 130
WBACK
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
L3b0 @ 250
FREE MSHR
L3b3 @ 270
HIT
Core0 @ 290
Thread 0 Thread 1
Domain 0
![Page 124: TUNING SLIDE - GitHubzsim.csail.mit.edu/tutorial/slides/intro.pdf · TUNING SLIDE. MICRO-48 Tutorial December 5, 2015 Fast and Accurate Microarchitectural ... MIT CSAIL. Welcome!](https://reader034.fdocuments.in/reader034/viewer/2022051510/5fecc8ae2b9f371c39733fb2/html5/thumbnails/124.jpg)
Events are lower-bounded No ordering violations
e.g., 110 lines of code to integrate with DRAMSim2
Bound-Weave Example124
Delays propagate across crossings:
Works with standard event-driven models!
L3b1 @ 50
HIT
Core0 @ 30 Core0 @ 60
L3b0 @ 80
MISS
Mem1 @ 110
READ Mem0 @ 130
WBACK
Core0 @ 90 Core0 @ 250
L3b0 @ 230
RESP
L3b0 @ 250
FREE MSHR
L3b3 @ 270
HIT
Core0 @ 290
Thread 0 Thread 1
Domain 0
Row miss +50 cycles
280
290
300
320
350