Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys...
Transcript of Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys...
© Synopsys 2012 1
Compiling a Parallel DSL to GPU
Ramesh Narayanaswamy
Badri Gopalan
Synopsys Inc.
GTC 2012
© Synopsys 2012 2
Agenda
• Overview of Verilog Simulation
• Parallel Verilog Simulation Algorithms
• Parallel Simulation Tradeoffs on GPU
• Challenges
GTC 2012
© Synopsys 2012 3
Verilog Hardware Description Language Objectives
• Hardware Models
– Multiple Implementation Levels – Gate, Register Transfer
– Input Output Behavior – Behavioral
• Testbench
– Generate Input Data, Expected Results
– Drive and Compare
– Input Output Behavior Level
– SystemVerilog Testbench Extensions
GTC 2012
Model Hardware • Modern chips run into 100s of Millions of Gates
• Gates in the real world can be thought as parallel blocks in simulation
Model Testbench to test Hardware
© Synopsys 2012 4
Verilog – Models
• Gate Level
– Interconnect
− Combinational logic
− State holding gates: Latches, Flip Flops
− Largely Bit level
• Register Transfer Level (RTL)
– Interconnect Guarded Processes
– Word Level
– Process has Multi line behavior
– Some Process advance clock by one
– Clocks, Sequences are explicit
• Behavioral Level
– Process encompasses many clock cycles
– Process may be longer
GTC 2012
b
a
c q
clk
reg [31:0] q, a, b; reg c, clk;
always @(posedge clk)
if (c)
q <= a;
else
q <= b;
reg [31:0] q, a, b; reg c, clk;
always @(posedge clk)
if (c)
wait @(posedge clk);q <= c;
else
q <= b;
Combinational
Logic Sequential
Logic
Guarded
Process
Behavioral
statements
© Synopsys 2012 5
How Verilog Models Parallelism
GTC 2012
reg [31:0] a,c,b;
always @(a or b)
c = ^b | a;
reg [31:0] a,b,y;
always @(posedge clk)
if (reset)
a <= y;
Wakeup
Guard
reg [31:0] a,b,c,y;
always @(posedge clk)
if (reset)
y <= c & b ^ a;
Read
Global State Simulation Time
Variable Values
Process State
• Where to continue ?
• Blocked on a guard ?
• Executing ?
• Guarded Processes
– @(A or B) change on A or change on B
– @(posedge C) 0 to {1,x,z} or {x,z} to 1 change on C
– #10; 10 time units have elapsed
• Process body - Sequential code with Assignments
• Global State
– Simulation Time
– Values of Variables
– Process State
Collection of Executing Processes: Parallel Workload
HDL Simulators can implement Parallel Workload with:
•Serial Simulation Algorithms
•Parallel Simulation Algorithms
© Synopsys 2012 6
Serial Simulation Algorithms Event Driven Simulation with Dynamic Scheduling
GTC 2012
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
q
q
Each block guarded by
a change-check
Functional evaluation
dynamically scheduled,
executes only what is
needed
Many optimizations for
serial simulation, works
in normal scenarios of
low event activity
But: not easy to expose
parallelism
0->1
0->1
Blocks evaluated in
Current clock cycle
Blocks inactive in
Current clock cycle
© Synopsys 2012 7
Serial Simulation Algorithms Oblivious simulation
GTC 2012
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
q
q
All parallel blocks
evaluated on any
change
No event guards
required: simplifies
implementation
On the surface, looks
suitable for fine grained
parallel implementation
But: high overhead in
the case of low event
activity %
Blocks evaluated in
Current clock cycle
0->1
0->1
Blocks inactive in
Current clock cycle
© Synopsys 2012 8
Parallel Simulation Algorithms Task Based Parallelism
GTC 2012
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
q
q
Design (and / or
testbench) split into
partitions
Each partition mapped
into an execution unit
More suitable for bigger
cores
Speedup depends on
relative and balanced
activity in partitions
0->1
0->1
Blocks evaluated in
Current clock cycle
Blocks inactive in
Current clock cycle
Partition 1 Partition 2
© Synopsys 2012 9
Parallel Simulation Algorithms Fine Grain Parallelism + Oblivious simulation
GTC 2012
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
q
q
All parallel blocks
evaluated on any
change
No event guards
required: simplifies
implementation
Lots of Parallel Blocks
Blocks evaluated in
Current clock cycle
0->1
0->1
Blocks inactive in
Current clock cycle
© Synopsys 2012 10
Parallel Simulation Algorithms Mapping Oblivious simulation to GPU
GTC 2012
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
b
a
c q
clk
q
q
0->1
0->1
Level
1
Level
2
Level
3
Level
1
Level
2
Level
3
© Synopsys 2012 11
Parallel Simulation Complexity Hypothetical Medium Sized Chip RTL
GTC 2012
Model: 1 Million+ user processes
Workload:
• 10K+Targeted Tests
• Each test targets a chip feature
Simulation
• Data Dependence between partitions
• Data Dependent Activity
• 0.5 – 3% Activity per Test Phase
• Low Effective Parallelism
• Activity
• Tail Region
Data
Dependent
Activity
Data
dependence
between
partitions
Tail Data flow
© Synopsys 2012 12
Summary Challenges / Benefits in using GPU for RTL Sim
• What Works ?
– A lot of potential parallelism in the model
− Fermi / CUDA thread scheduling
– A lot of memory accesses per clock cycle
− Fermi provides 144 GBps
– CUDA software ecosystem is robust and improving
• Wishlist
– Global Barrier
– Latency Optimized Core
– Lower launch overhead
– CUDA profiler for large datasets
GTC 2012
© Synopsys 2012 13
References
• Mary L. Bailey, Jack V. Briner, Jr., and Roger D.
Chamberlain. 1994. Parallel logic simulation of VLSI
systems. ACM Comput. Surv. 26, 3 (September 1994),
255-294
• Keckler, S.W.; Dally, W.J.; Khailany, B.; Garland, M.;
Glasco, D.; , "GPUs and the Future of Parallel
Computing," Micro, IEEE , vol.31, no.5, pp.7-17, Sept.-
Oct. 2011
GTC 2012
© Synopsys 2012 14
Questions ?
GTC 2012