Supporting x86-64 Address Translation for 100s of GPU Lanes
-
Upload
john-herrera -
Category
Documents
-
view
73 -
download
4
description
Transcript of Supporting x86-64 Address Translation for 100s of GPU Lanes
![Page 1: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/1.jpg)
UW-Madison Computer Sciences
Supporting x86-64 Address Translation for 100s of GPU Lanes
Jason Power, Mark D. Hill, David A. WoodBased on HPCA 20 paper
2/19/2014
![Page 2: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/2.jpg)
UW-Madison Computer Sciences 2
Summary
• CPUs & GPUs: physically integrated, logically separateNear Future: Cache coherence, Shared virtual address space
• Proof-of-concept GPU MMU design– Per-CU TLBs, highly-threaded PTW, page walk cache– Full x86-64 support– Modest performance decrease (2% vs. ideal MMU)
• Design alternatives not chosen
2/19/2014
![Page 3: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/3.jpg)
UW-Madison Computer Sciences 3
Motivation
• Closer physical integration of GPGPUs
• Programming model still decoupled– Separate address spaces– Unified virtual address (NVIDIA) simplifies code– Want: Shared virtual address space (HSA hUMA)
2/19/2014
![Page 4: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/4.jpg)
UW-Madison Computer Sciences 42/19/2014
Tree Index
Hash-table Index
CPU address space
![Page 5: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/5.jpg)
UW-Madison Computer Sciences 5
Separate Address Space
2/19/2014
CPU address space
GPU address space
Simply copy data Transform to new pointers
Transform to new pointers
![Page 6: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/6.jpg)
UW-Madison Computer Sciences 6
Unified Virtual Addressing
2/19/2014
CPU address space
GPU address space
1-to-1 addresses
![Page 7: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/7.jpg)
UW-Madison Computer Sciences 7
void main() {int *h_in, *h_out;
h_in = cudaHostMalloc(sizeof(int)*1024); // allocate input array on host
h_in = ... // Initial host array
h_out = cudaHostMalloc(sizeof(int)*1024); // allocate output array on host
Kernel<<<1,1024>>>(h_in, h_out);
... h_out // continue host computation with result
cudaHostFree(h_in); // Free memory on hostcudaHostFree(h_out);
}
2/19/2014
Unified Virtual Addressing
![Page 8: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/8.jpg)
UW-Madison Computer Sciences 8
Shared Virtual Address Space
2/19/2014
CPU address space
GPU address space
1-to-1 addresses
![Page 9: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/9.jpg)
UW-Madison Computer Sciences 9
Shared Virtual Address Space
• No caveats
• Programs “just work”
• Same as multicore model
2/19/2014
![Page 10: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/10.jpg)
UW-Madison Computer Sciences 10
Shared Virtual Address Space
• Simplifies code• Enables rich pointer-based datastructures– Trees, linked lists, etc.
• Enables composablity
Need: MMU (memory management unit) for the GPU • Low overhead• Support for CPU page tables (x86-64)• 4 KB pages• Page faults, TLB flushes, TLB shootdown, etc
2/19/2014
![Page 11: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/11.jpg)
UW-Madison Computer Sciences 11
Outline
• Motivation
• Data-driven GPU MMU designD1: Post-coalescer MMUD2: +Highly-threaded page table walkerD3: +Shared page walk cache
• Alternative designs
• Conclusions2/19/2014
![Page 12: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/12.jpg)
UW-Madison Computer Sciences 12
System Overview
2/19/2014
CPUGPU
L2 C
ache
DRAM
CPUCore
L1 Cache
L2 Cache
Compute Unit
Compute Unit
Compute Unit
Compute Unit
![Page 13: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/13.jpg)
UW-Madison Computer Sciences 13
GPU Overview
2/19/2014
Compute Unit (32 lanes ) GPU
Instruction Fetch / Decode
Coalescer
L1 Cache
Scratchpad Memory
L2 C
ache
Register File
Compute Unit
Compute Unit
Compute Unit
Compute Unit
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
![Page 14: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/14.jpg)
UW-Madison Computer Sciences 14
Methodology
• Target System– Heterogeneous CPU-GPU system
• 16 CUs, 32 lanes each
– Linux OS 2.6.22.9
• Simulation– gem5-gpu – full system mode (gem5-gpu.cs.wisc.edu)
• Ideal MMU– Infinite caches– Minimal latency
2/19/2014
![Page 15: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/15.jpg)
UW-Madison Computer Sciences 15
Workloads
• Rodinia benchmarks
• Database sort– 10 byte keys, 90 byte payload
2/19/2014
– backprop– bfs: Breadth-first search– gaussian– hotspot– lud: LU-decomposition
– nn: nearest-neighbor– nw: Needleman-Wunsch– pathfinder– srad: anisotropic diffusion
![Page 16: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/16.jpg)
UW-Madison Computer Sciences 16
GPU MMU Design 0
2/19/2014
CU CU CU CU
L1Cache
L1Cache
L1Cache
L1Cache
L2 Cache
Coalescer
I-FetchRegister File
Coalescer
I-FetchRegister File
Coalescer
I-FetchRegister File
Coalescer
I-FetchRegister File
TLBTLBTLBTLBTLBTLBTLBTLB
TLBTLBTLBTLBTLBTLBTLBTLB
TLBTLBTLBTLBTLBTLBTLBTLB
TLBTLBTLBTLBTLBTLBTLBTLB
![Page 17: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/17.jpg)
UW-Madison Computer Sciences 17
GPU MMU Design 1
Per-CU MMUs: Reducing the translation request rate
2/19/2014
![Page 18: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/18.jpg)
UW-Madison Computer Sciences 18
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
2/19/2014
![Page 19: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/19.jpg)
UW-Madison Computer Sciences 19
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Scratchpad Memory
1x
0.45x
2/19/2014
![Page 20: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/20.jpg)
UW-Madison Computer Sciences 20
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Coalescer
1x
0.45x
0.06x
2/19/2014
Scratchpad Memory
![Page 21: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/21.jpg)
UW-Madison Computer Sciences 21
Reducing translation request rate
• Shared memory and coalescer effectively filter global memory accesses• Average 39 TLB accesses per 1000 cycles for 32 lane CU
Breakdown of memory operations
2/19/2014
![Page 22: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/22.jpg)
UW-Madison Computer Sciences 22
GPU MMU Design 1
2/19/2014
L1Cache
L1Cache
L1Cache
L2 Cache
TLB TLB TLB TLB
Shared page walk unit
Coalescer Coalescer Coalescer Coalescer
Page
faul
t re
gist
er
L1Cache
![Page 23: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/23.jpg)
UW-Madison Computer Sciences 23
Performance
2/19/2014
![Page 24: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/24.jpg)
UW-Madison Computer Sciences 24
GPU MMU Design 2
Highly-threaded page table walker:Increasing TLB miss bandwidth
2/19/2014
![Page 25: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/25.jpg)
UW-Madison Computer Sciences 25
Outstanding TLB Misses
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Coalescer
Scratchpad Memory
2/19/2014
![Page 26: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/26.jpg)
UW-Madison Computer Sciences 26
Multiple Outstanding Page Walks
• Many workloads are bursty• Miss latency skyrockets due to queuing delays if blocking page walker
Concurrent page walks per CU when non-zero (log scale)
2/19/2014
2
![Page 27: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/27.jpg)
UW-Madison Computer Sciences 27
GPU MMU Design 2
2/19/2014
L1Cache
L1Cache
L1Cache
L2 Cache
TLB TLB TLB TLB
Shared page walk unit
Coalescer Coalescer Coalescer Coalescer
Highly-threadedPage table
walker
Page
wal
k bu
ffer
s
Page
faul
t re
gist
er
L1Cache
![Page 28: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/28.jpg)
UW-Madison Computer Sciences 28
Highly-threaded PTW
2/19/2014
Page walk state machine
Outstanding addr StateOutstanding addr StateOutstanding addr StateOutstanding addr State
Outstanding addr State
From memory To memory
TLB TLB TLBPer-CU TLBs:
Page walk buffers
32
![Page 29: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/29.jpg)
UW-Madison Computer Sciences 29
Performance
2/19/2014
![Page 30: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/30.jpg)
UW-Madison Computer Sciences 30
GPU MMU Design 3
Page walk cache:Overcoming high TLB miss rate
2/19/2014
![Page 31: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/31.jpg)
UW-Madison Computer Sciences 31
High L1 TLB Miss Rate
• Average miss rate: 29%!
2/19/2014
Per-access Rate with 128 entry L1 TLB
![Page 32: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/32.jpg)
UW-Madison Computer Sciences 32
High TLB Miss Rate
• Requests from coalescer spread out– Multiple warps issuing unique streams– Mostly 64 and 128 byte requests
• Often streaming access patterns– Each entry not reused many times– Many demand misses
2/19/2014
![Page 33: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/33.jpg)
UW-Madison Computer Sciences 33
GPU MMU Design 3
2/19/2014
L1Cache
L1Cache
L1Cache
L2 Cache
TLB TLB TLB TLB
Shared page walk unit
Coalescer Coalescer Coalescer Coalescer
Highly-threadedPage table
walker
Page walk cache
Page
wal
k bu
ffer
s
Page
faul
t re
gist
er
L1Cache
Shared by16 CUs
![Page 34: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/34.jpg)
UW-Madison Computer Sciences 34
Performance
2/19/2014
Worst case: 12% slowdown
Average: Less than 2% slowdown
![Page 35: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/35.jpg)
UW-Madison Computer Sciences 35
Correctness Issues
• Page faults– Stall GPU as if TLB miss– Raise interrupt on CPU core– Allow OS to handle in normal way– One change to CPU microcode, no OS changes
• TLB flush and shootdown– Leverage CPU core, again– Flush GPU TLBs whenever CPU core TLB is flushed
2/19/2014
![Page 36: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/36.jpg)
UW-Madison Computer Sciences 36
Page-fault Overview
1. Write faulting address to CR2
2. Interrupt CPU
3. OS handles page fault
4. iret instruction executed
5. GPU notified of completed page fault
2/19/2014
CPU core
GPU MMU
CR3
CR2
CR3
PF register
Page table
![Page 37: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/37.jpg)
UW-Madison Computer Sciences 37
Page faultsNumber of page faults
Average page fault latency
Lud 1 1698 cyclesPathfinder 15 5449 cyclesCell 31 5228 cyclesHotspot 256 5277 cyclesBackprop 256 5354 cycles
• Minor Page faults handled correctly by Linux 2.6.22.9• 5000 cycles = 3.5 µsec: too fast for GPU context switch
2/19/2014
![Page 38: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/38.jpg)
UW-Madison Computer Sciences 38
Summary
• D3: proof of concept • Non-exotic design
2/19/2014
Per-CU L1 TLB entries
Highly-threadedpage table walker
Page walk cache size
Shared L2 TLB entries
Ideal MMU Infinite Infinite Infinite NoneDesign 0 N/A: Per-lane MMUs None NoneDesign 1 128 Per-CU walkers None NoneDesign 2 128 Yes (32-way) None NoneDesign 3 64 Yes (32-way) 8 KB None
Low overhead Fully compatible Correct execution
![Page 39: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/39.jpg)
UW-Madison Computer Sciences 39
Outline
• Motivation
• Data-driven GPU MMU design
• Alternative designs – Shared L2 TLB– TLB prefetcher– Large pages
• Conclusions2/19/2014
![Page 40: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/40.jpg)
UW-Madison Computer Sciences 40
Alternative Designs
• Shared L2 TLB– Shared between all CUs– Captures address translation sharing– Poor performance: increased TLB-miss penalty
• Shared L2 TLB and PWC– No performance difference– Trade-offs in the design space
2/19/2014
![Page 41: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/41.jpg)
UW-Madison Computer Sciences 41
Alternative Design Summary
2/19/2014
• Each design has 16 KB extra SRAM for all 16 CUs• PWC needed for some workloads• Trade-offs in energy and area
Per-CU L1 TLB entries
Page walk cache size
Shared L2 TLB entries
Performance
Design 3 64 8 KB None 0%
Shared L2 64 None 1024 50%
Shared L2 & PWC 32 8 KB 512 0%
![Page 42: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/42.jpg)
UW-Madison Computer Sciences 42
GPU MMU Design Tradeoffs
• Shared L2 & PWC provices lower area, but higher energy• Proof-of-concept Design 3 provides low area and energy
2/19/2014
*data produced using McPAT and CACTI 6.5
![Page 43: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/43.jpg)
UW-Madison Computer Sciences 43
Alternative Designs
• TLB prefetching:– Simple 1-ahead prefetcher does not affect
performance
• Large pages:– Effective at reducing TLB misses– Requiring places burden on programmer– Must have compatibility
2/19/2014
![Page 44: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/44.jpg)
UW-Madison Computer Sciences 44
Related Work
• “Architectural Support for Address Translation on GPUs.” Bharath Pichai, Lisa Hsu, Abhishek Bhattacharjee (ASPLOS 2014).
– Concurrent with our work– High level: modest changes for high performance– Pichai et al. investigate warp scheduling– We use a highly-threaded PTW
2/19/2014
![Page 45: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/45.jpg)
UW-Madison Computer Sciences 45
Conclusions
• Shared virtual memory is important
• Non-exotic MMU design– Post-coalescer L1 TLBs– Highly-threaded page table walker– Page walk cache
• Full compatibility with minimal overhead
2/19/2014
![Page 46: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/46.jpg)
UW-Madison Computer Sciences 46
Questions?
2/19/2014
![Page 47: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/47.jpg)
UW-Madison Computer Sciences 47
Simulation Details
backprop 74 MB nn 500 KB
bfs 37 MB nw 128 MB
gaussian 340 KB pathfinder 38 MB
hotspot 12 MB srad 96 MB
lud 4 MB sort 208 MB
CPU 1 core, 2 GHz, 64 KB L1, 2 MB L2
GPU 16 CUs, 1.4 GHz, 32 lanes
L1 Cache (per-CU) 64 KB, 15 ns latency
Scratchpad memory 16 KB, 15 ns latency
GPU L2 cache 1 MB, 16-way set associative, 130 ns latency
DRAM 2 GB, DDR3 timing, 8 channels, 667 MHz
2/19/2014
![Page 48: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/48.jpg)
UW-Madison Computer Sciences 48
Shared L2 TLB
2/19/2014
Sharing pattern
• Many benchmarks share translations between CUs
![Page 49: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/49.jpg)
UW-Madison Computer Sciences 49
Perf. of Alternative Designs
• D3 uses AMD-like physically-addressed page walk cache• Ideal PWC performs slightly better (1% avg, up to 10%)
2/19/2014
![Page 50: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/50.jpg)
UW-Madison Computer Sciences 50
Large pages are effective
• 2 MB pages significantly reduce miss rate
• But can they be used?– slow adoption on
CPUs– Special allocation API
Relative miss rate with 2MB pages
2/19/2014
![Page 51: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/51.jpg)
UW-Madison Computer Sciences 51
Shared L2 TLB
2/19/2014
L1Cache
L1Cache
L1Cache
L2 Cache
TLB TLB TLB TLB
Shared page walk unit
Coalescer Coalescer Coalescer Coalescer
Highly-threadedPage table
walker
Page
wal
k bu
ffer
s
Page
faul
t re
gist
er
L1Cache
Shared L2 TLB
![Page 52: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/52.jpg)
UW-Madison Computer Sciences 52
Shared L2 TLB and PWC
2/19/2014
L1Cache
L1Cache
L1Cache
L2 Cache
TLB TLB TLB TLB
Shared page walk unit
Coalescer Coalescer Coalescer Coalescer
Highly-threadedPage table
walker
Page
wal
k bu
ffer
s
Page
faul
t re
gist
er
L1Cache
Shared L2 TLB
Page walk cache
![Page 53: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/53.jpg)
UW-Madison Computer Sciences 53
Shared L2 TLB
2/19/2014
Performance relative to D3
• A shared L2 TLB can work as well as a page walk cache• Some workloads need low latency TLB misses
![Page 54: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/54.jpg)
UW-Madison Computer Sciences 54
Shared L2 TLB and PWC
• Benefits of both page walk cache and shared L2 TLB• Can reduce the size of L1 TLB without performance penalty
2/19/2014
![Page 55: Supporting x86-64 Address Translation for 100s of GPU Lanes](https://reader035.fdocuments.in/reader035/viewer/2022062308/5681305c550346895d962045/html5/thumbnails/55.jpg)
UW-Madison Computer Sciences 55
Unified MMU Design
L1Cache
L1Cache
L1Cache
L1Cache
L2 Cache
L2 TLB(optional)
Multi-threadedpagetable
walker
Pagewalk Cache(optional)
Page
wal
k bu
ffer
sPa
ge fa
ult
regi
sterTLB TLB TLB TLB
Shared page walk unit
(a)
(b)
①
② ③
④
⑥
⑦
⑧
⑤
⑨
⑩
Coalescer Coalescer Coalescer Coalescer
Memory request(virtual addr)
Memory request(physical addr)
Page walk request & response
2/19/2014