Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys...

© Synopsys 2012 1

Compiling a Parallel DSL to GPU

Ramesh Narayanaswamy

Badri Gopalan

Synopsys Inc.

GTC 2012

© Synopsys 2012 2

Agenda

• Overview of Verilog Simulation

• Parallel Verilog Simulation Algorithms

• Parallel Simulation Tradeoffs on GPU

• Challenges

GTC 2012

© Synopsys 2012 3

Verilog Hardware Description Language Objectives

• Hardware Models

– Multiple Implementation Levels – Gate, Register Transfer

– Input Output Behavior – Behavioral

• Testbench

– Generate Input Data, Expected Results

– Drive and Compare

– Input Output Behavior Level

– SystemVerilog Testbench Extensions

GTC 2012

Model Hardware • Modern chips run into 100s of Millions of Gates

• Gates in the real world can be thought as parallel blocks in simulation

Model Testbench to test Hardware

© Synopsys 2012 4

Verilog – Models

• Gate Level

– Interconnect

− Combinational logic

− State holding gates: Latches, Flip Flops

− Largely Bit level

• Register Transfer Level (RTL)

– Interconnect Guarded Processes

– Word Level

– Process has Multi line behavior

– Some Process advance clock by one

– Clocks, Sequences are explicit

• Behavioral Level

– Process encompasses many clock cycles

– Process may be longer

GTC 2012

b

a

c q

clk

reg [31:0] q, a, b; reg c, clk;

always @(posedge clk)

if (c)

q <= a;

else

q <= b;

reg [31:0] q, a, b; reg c, clk;


if (c)

wait @(posedge clk);q <= c;

else

q <= b;

Combinational

Logic Sequential

Logic

Guarded

Process

Behavioral

statements

© Synopsys 2012 5

How Verilog Models Parallelism

GTC 2012

reg [31:0] a,c,b;

always @(a or b)

c = ^b | a;

reg [31:0] a,b,y;


if (reset)

a <= y;

Wakeup

Guard

reg [31:0] a,b,c,y;


if (reset)

y <= c & b ^ a;

Read

Global State Simulation Time

Variable Values

Process State

• Where to continue ?

• Blocked on a guard ?

• Executing ?

• Guarded Processes

– @(A or B) change on A or change on B

– @(posedge C) 0 to {1,x,z} or {x,z} to 1 change on C

– #10; 10 time units have elapsed

• Process body - Sequential code with Assignments

• Global State

– Simulation Time

– Values of Variables

– Process State

Collection of Executing Processes: Parallel Workload

HDL Simulators can implement Parallel Workload with:

•Serial Simulation Algorithms

•Parallel Simulation Algorithms

© Synopsys 2012 6

Serial Simulation Algorithms Event Driven Simulation with Dynamic Scheduling

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

Each block guarded by

a change-check

Functional evaluation

dynamically scheduled,

executes only what is

needed

Many optimizations for

serial simulation, works

in normal scenarios of

low event activity

But: not easy to expose

parallelism

0->1

0->1

Blocks evaluated in

Current clock cycle

Blocks inactive in

Current clock cycle

© Synopsys 2012 7

Serial Simulation Algorithms Oblivious simulation

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

All parallel blocks

evaluated on any

change

No event guards

required: simplifies

implementation

On the surface, looks

suitable for fine grained

parallel implementation

But: high overhead in

the case of low event

activity %

Blocks evaluated in

Current clock cycle

0->1

0->1

Blocks inactive in

Current clock cycle

© Synopsys 2012 8

Parallel Simulation Algorithms Task Based Parallelism

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

Design (and / or

testbench) split into

partitions

Each partition mapped

into an execution unit

More suitable for bigger

cores

Speedup depends on

relative and balanced

activity in partitions

0->1

0->1

Blocks evaluated in

Current clock cycle

Blocks inactive in

Current clock cycle

Partition 1 Partition 2

© Synopsys 2012 9

Parallel Simulation Algorithms Fine Grain Parallelism + Oblivious simulation

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

All parallel blocks

evaluated on any

change

No event guards

required: simplifies

implementation

Lots of Parallel Blocks

Blocks evaluated in

Current clock cycle

0->1

0->1

Blocks inactive in

Current clock cycle

© Synopsys 2012 10

Parallel Simulation Algorithms Mapping Oblivious simulation to GPU

GTC 2012

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

b

a

c q

clk

q

q

0->1

0->1

Level

1

Level

2

Level

3

Level

1

Level

2

Level

3

© Synopsys 2012 11

Parallel Simulation Complexity Hypothetical Medium Sized Chip RTL

GTC 2012

Model: 1 Million+ user processes

Workload:

• 10K+Targeted Tests

• Each test targets a chip feature

Simulation

• Data Dependence between partitions

• Data Dependent Activity

• 0.5 – 3% Activity per Test Phase

• Low Effective Parallelism

• Activity

• Tail Region

Data

Dependent

Activity

Data

dependence

between

partitions

Tail Data flow

© Synopsys 2012 12

Summary Challenges / Benefits in using GPU for RTL Sim

• What Works ?

– A lot of potential parallelism in the model

− Fermi / CUDA thread scheduling

– A lot of memory accesses per clock cycle

− Fermi provides 144 GBps

– CUDA software ecosystem is robust and improving

• Wishlist

– Global Barrier

– Latency Optimized Core

– Lower launch overhead

– CUDA profiler for large datasets

GTC 2012

© Synopsys 2012 13

References

• Mary L. Bailey, Jack V. Briner, Jr., and Roger D.

Chamberlain. 1994. Parallel logic simulation of VLSI

systems. ACM Comput. Surv. 26, 3 (September 1994),

255-294

• Keckler, S.W.; Dally, W.J.; Khailany, B.; Garland, M.;

Glasco, D.; , "GPUs and the Future of Parallel

Computing," Micro, IEEE , vol.31, no.5, pp.7-17, Sept.-

Oct. 2011

GTC 2012

Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys...

Documents

Transcript of Compiling a Parallel DSL to GPU - GPU Technology Conference … · 2013. 8. 23. · © Synopsys...