1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

67
1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    3

Transcript of 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

Page 1: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

1

Dependence in C and Hardware Design

Allen and Kennedy, Chapter 12

Presented by Tali Shragai

Page 2: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

2

Today’s lecture…

Page 3: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

3

Introduction

So far, we’ve discussed dependence analysis in Fortran

Dependence analysis applies to any language and translation context where arrays and loops are useful

Application to C and C++ Modern features (pointers, structures…)

Application to hardware design Language based approach

Page 4: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

4

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

Page 5: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

5

Problems of C C\C++ focuses on simplified software

development, at the expense of optimizability

Optimization may not be desired: Polling a keyboard example -while (!(t=*p));Optimizer would move p outside the loop…

Use of C\C++ has expanded into areas where optimization is required…

Page 6: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

6

Problems of C - Examplevoid vadd(double *a, double *b, double *c, int n){

while(n--)*a++ = *b++ + *c++;

}

Would be easily vectorized & optimized inFortran, but not in C: Pointers

Memory locations accessed by pointers is not clear (unlike for arrays…)

Aliasing C does not guarantee that arrays passed into

subroutine do not overlap

Page 7: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

7

Problems of C – Example (cont.)void vadd(double *a, double *b, double *c, int n){

while(n--)

*a++ = *b++ + *c++;

}

Side-effect operators Pre\post increment operators conceal the index

calculations for addressing arrays Optimizer focuses extra effort on transformations

(induction-variable substitution…) Loops

Fortran loops provides values and restrictions to simplify optimizations

Page 8: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

8

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

Page 9: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

9

Pointers

C optimizers’ most difficult challenge is unrestricted pointers: Hard to resolve indirect pointer access:

pointer variable can point to different memory locations during its use

Aliasing memory locations: memory location can be accessed by more than one pointer variable at any given time

Resulting in a much more difficult and expensive dependence testing

Page 10: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

10

Pointers dependence testing

Compiler can replace pointers indirections like *p by subscripted array references n[e], for dependence testing.

But another pointer q might access the same place need to be replaced with the pseudo array n too…

In the worst case, must assume that each pair of references is dependent!

Page 11: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

11

Dependence testing strategies Safety Assertions:

Use compiler options / pragmas to indicate “disciplined” code Safe parameters

All pointer parameters point to independent storage Safe pointers

All pointer variables (parameter, local, global) point to independent storage

Whole-Program Analysis:Without separate compilation, analyzing dependency in the entire program is solvable, but still unsatisfactory

Page 12: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

12

Naming and Structures

In Fortran, unlike C, a block of storage can be uniquely identified by a single name simplify dependence analysis

Dependence analysis requires a single name for all references to the same location

C’s constructs complicate this:p;*p;**p;*(p+4);*(&p+4);

p[1]*p

**p

p

&p

Page 13: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

13

Naming and Structures (cont.)

Troublesome structures Naming problem

What is the name of ‘a.b’ ? Unions

Allow different sized objects to overlap same storage

Need to reduce references to the same common unit of smallest storage possible

Page 14: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

14

Loops Lack of constraints in C

Jumping into loop body is permitted Induction variable (if there’s any) can be

modified in the body of the loop Loop increment value may also be changed Conditions controlling the initiation,

increment, and termination of the loop have no constraints on their form

Might be hard to identify a loop variable with start and end values

Page 15: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

15

Loops (cont.) Rewrite while as a DO loop

The induction variable: Only one! Must be initialized with the same value on all

paths into the loop Must have one and only one increment in the loop

The increment must be executed on every iteration

The termination condition must match No jumps from outside of the loop body

Page 16: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

16

Scoping and Statics Scoping rules might create extra

aliasing Handled by creating unique symbols for

variables with same name but different scopes

Static variables File-static variable can only be modified by

procedures that see its declaration. Access to the variable can be determined

from scope information in the symbol table. Storing an address parameter in a static

variable makes it accessible from any other procedures.

Page 17: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

17

Problematic C Dialects Some of C code might look “tidy”, as in

Fortran. “Messy” style conventions:

Use of pointers instead of arrays Use of address and dereference operators Use of side effect operators

Previously mapped to machine instructions Complicate the work of optimizers

Page 18: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

18

Problematic C Dialects (cont.) Titan C Compiler: remove side effect

operators! But, requires enhancements in some

transformations Constant propagation

Treat address operators as constants and propagate them where possible

Replace generic pointer in a dereference with the actual address

Expression simplification and recognition Need stronger recognition within expression

where variable is actually the ‘base variable’

Page 19: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

19

Problematic C Dialects (cont.)

Conversion of pointers into array references

Simplifies dependence testing Induction variable substitution need to

enhanced Deal with indirect access to array references

through pointers Recognize and remove usage of side-effect

operators

Page 20: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

20

C Miscellaneous Volatile variables

Functions with these variables are best left without optimization

Volatile code usually isn’t targeted for optimization (example: vector unit initialization)

Setjmp and Longjmp Commonly used for error handling: Calling setjmp

saves current context in a buffer. longjmp can then be called and bypass section of the calling chain

Storing and loading current state of computation is complex when optimization is performed and variables are allocated to registers

No optimization used!

Page 21: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

21

C Miscellaneous (cont.)

Varags and stdargs Variable number of arguments

printf(…)

Implemented by a complier directive: Save all register parameters to the stack Access using pointer manipulation over

the stack Pointer variable is an alias for many

parameters in the program No optimization

Page 22: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

22

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

Page 23: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

23

Hardware Design: Overview

In the past, HW design was done at gate\transistor level

Today, HW design is language-based, similarly to SW development

Level of abstraction may vary Current trend is high level behavioral

specification Key factor: compiler’s efficiency

Page 24: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

24

Abstraction levels for HW Design

Circuit / Physical level Diagrams of electronic components

Logic level Boolean equations

Register transfer level (RTL) Control state transitions and data transfers Synthesis: convert RTL to gates and flip-flops

System level Behavior expressed by variables, no timing Behavioral synthesis: select arithmetic units,

impose timing

Most Common!

Page 25: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

25

Hardware Design

Behavior Synthesis is really a compilation problem

Two fundamental tasks Verification (Simulation) Implementation (Synthesis)

Optimization is essential for both: HW Simulation is inherently very slow Efficient synthesis raises the device’s value

Page 26: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

26

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

Page 27: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

27

Hardware Description Languages 2 main HDLs used today:

Verilog Supports all abstraction levels, mostly used for

gates and RTL Extended C

VHDL Extended Ada

Primitives and extensions used for HW description provide similar functionality

Page 28: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

28

Verilog extensions Multi-valued logic: 0, 1, x, z

x = unknown state, z = bus conflict E.g. division by zero produces x state Higher date types (integer…) are vectors of

multi-valued bits Operations with x will result in x state ->

simulation can’t execute addition directly… Reactivity

Automatic propagation of changes always @(b or c)

a = b + c;

Page 29: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

29

Verilog extensions (cont.) Objects

Specific area of silicon, own state and registers Semantics different from C’s functions calls Data encapsulation using module

Connectivity Continuous passing of information Input port and output port

module add(a,b,c) output a; input b, c; integer a, b, c; always @(b or c) a = b + c;endmodule

Page 30: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

30

Verilog extensions (cont.) Instantiation

Verilog only allows static instantiation Each instance is a different area in the

siliconinteger x, y, z;add adder1(x,y,z);

Vector operations Viewing other data structures as vector of

scalars Bit selection: A[1] Vector concatenation: {A[0], A[1:15]}

Page 31: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

31

Optimization in Verliog

Advantages Disadvantages

No aliasing Non-procedural continuation semantics

Restricted subscripts Lack of loops (only implicitly using “always” blocks)

No separate compilation HW design is large

Page 32: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

32

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

Page 33: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

33

Optimizing Simulation

Philosophy: Higher abstraction level more efficient! Details consume simulation time and obscure

behavioral functionality Next…

Optimization techniques

module adder(a, b, c)

input b[0:3], c[0:3];

output a[0:3];

always @(b or c)

a = b + c;

endmodule

Page 34: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

34

Inlining Modules

Data encapsulation: Hide details from the programmer Hide optimization info from the compiler

HDLs have two properties that make module inlining simpler Whole design is reachable at one time Recursion is not permitted inline in linear time

using topological order No use to inline above the level of

functional units

Page 35: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

35

Inlining example - beforemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry;add2 add_l(a[0:1], 0, b[0:1], c[0:1], carry);add2 add_r(a[2:3], carry, b[2:3], c[2:3], 0);

endmodule

module add2(sum, c_out, op1, op2, c_in)input op1[0:1], op2[0:1], c_in;output sum[0:1], c_out;wire carry;add1 add_l(sum[0], carry, op1[0], op2[0], c_in);add1 add_r(sum[1], c_out, op1[1], op2[1], carry);

endmodule

module add1(sum, c_out, op1, op2, c_in)input op1, op2, c_in;output sum, c_out;always @(op1 or op2 or c_in) begin

sum = op1 ^ op2 ^ c_in;c_out = (op1&op2) | (op2&c_in) | (op1&c_in);

endmodule

Page 36: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

36

Inlining example - aftermodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(b[1] or c[1] or carry) begin

a[1] = b[1] ^ c[1] ^ carry;temp = (b[1]&c[1])|(c[1]&carry)|(carry&b[1]);

endalways @(b[0] or c[0] or temp) begin

a[0] = b[0] ^ c[0] ^ temp;0 = (b[0]&c[0])|(c[0]&temp)|(temp&b[0]);

endalways @(b[3] or c[3] or 0) begin

a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);

endalways @(b[2] or c[2] or temp1) begin

a[2] = b[2] ^ c[2] ^ temp1;carry = (b[2]&c[2])|(c[2]&temp1)|(temp1&b[2]);

endendmodule

Page 37: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

37

Execution Ordering

The order of statements execution can dramatically effect efficiency!

HW is fast, thanks to triggering on bit’s change

SW simulation cannot afford tracking bits Memory overhead May consider all bits as a group

Execute blocks in topological order based on the dependence graph of individual array elements No memory overhead

Page 38: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

38

Ordering examplemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(b[3] or c[3] or 0) begin

a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);

endalways @(b[2] or c[2] or temp1) begin

a[2] = b[2] ^ c[2] ^ temp1;carry = (b[2]&c[2])|(c[2]&temp1)|(temp1&b[2]);

endalways @(b[1] or c[1] or carry) begin

a[1] = b[1] ^ c[1] ^ carry;temp = (b[1]&c[1])|(c[1]&carry)|(carry&b[1]);

endalways @(b[0] or c[0] or temp) begin

a[0] = b[0] ^ c[0] ^ temp;0 = (b[0]&c[0])|(c[0]&temp)|(temp&b[0]);

endendmodule

Page 39: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

39

Dynamic vs. Static Scheduling

Dynamic scheduling Dynamically track changes in values and

propagate them Naturally mimics hardware Overhead of change checks

Especially if performed per bit Static scheduling

Based on a topological model Blindly sweeps through all values for all

objects regardless of changes No need for change checks

Page 40: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

40

Dynamic vs. Static Scheduling (cont.)

Prefer Static Scheduling for a highly active circuit When changes are frequent, no need to

check before update Can we really tell in advance?

Common strategy: use static analysis to locate dynamic scheduling improves simulation performance!

Page 41: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

41

Fusing always blocks

High cost of change checks motivates fusing always blocks

Can fuse blocks with the same trigger conditions Save overhead of invoking blocks

Most useful for synchronous designs But, may change the design’s output

Still semantically correct Bad surprise for the designer… Simulators try to avoid output changes

Page 42: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

42

Fusion examplemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(posedge(clk)) begin

a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);

endalways @(posedge(clk)) begin

a[2] = b[2] ^ c[2] ^ temp1;carry = (b[2]&c[2])|(c[2]&temp1)|(temp1&b[2]);

endalways @(posedge(clk)) begin

a[1] = b[1] ^ c[1] ^ carry;temp = (b[1]&c[1])|(c[1]&carry)|(carry&b[1]);

endalways @(posedge(clk)) begin

a[0] = b[0] ^ c[0] ^ temp;0 = (b[0]&c[0])|(c[0]&temp)|(temp&b[0]);

endendmodule

Page 43: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

43

Vectorizing always block

Regrouping low level operations back together for higher lever abstractions

Vectorizing the bit operations

Page 44: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

44

Vectorizing examplemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3]) begin

a[3] = b[3] ^ c[3] ^ carry[3];carry[2]=(b[3]&c[3])|(c[3]&carry[3])|(carry[3]&b[3]);

endalways @(b[2] or c[2] or carry[2]) begin

a[2] = b[2] ^ c[2] ^ carry[2];carry[1]=(b[2]&c[2])|(c[2]&carry[2])|(carry[2]&b[2]);

endalways @(b[1] or c[1] or carry[1]) begin

a[1] = b[1] ^ c[1] ^ carry[1];carry[0]=(b[1]&c[1])|(c[1]&carry[1])|(carry[1]&b[1]);

endalways @(b[0] or c[0] or carry[0]) begin

a[0] = b[0] ^ c[0] ^ carry[0];cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);

endendmodule

Scalar expansion

Page 45: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

45

Vectorizing example – merge blocks

module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3] or

b[2] or c[2] or carry[2] or b[1] or c[1] or carry[1] or

b[0] or c[0] or carry[0]) begina[3] = b[3] ^ c[3] ^ carry[3];carry[2]=(b[3]&c[3])|(c[3]&carry[3])|

(carry[3]&b[3]);a[2] = b[2] ^ c[2] ^ carry[2];carry[1]=(b[2]&c[2])|(c[2]&carry[2])|

(carry[2]&b[2]);a[1] = b[1] ^ c[1] ^ carry[1];carry[0]=(b[1]&c[1])|(c[1]&carry[1])|

(carry[1]&b[1]);a[0] = b[0] ^ c[0] ^ carry[0];cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);

endendmodule

Page 46: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

46

Vectorizing example – rearranged following dependencies

module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3] or

b[2] or c[2] or carry[2] or b[1] or c[1] or carry[1] or

b[0] or c[0] or carry[0]) begincarry[2]=(b[3]&c[3])|(c[3]&carry[3])|

(carry[3]&b[3]);carry[1]=(b[2]&c[2])|(c[2]&carry[2])|

(carry[2]&b[2]);carry[0]=(b[1]&c[1])|(c[1]&carry[1])|

(carry[1]&b[1]);cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);a[3] = b[3] ^ c[3] ^ carry[3];a[2] = b[2] ^ c[2] ^ carry[2];a[1] = b[1] ^ c[1] ^ carry[1];a[0] = b[0] ^ c[0] ^ carry[0];

endendmodule

Can Vectorize!!

Page 47: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

47

Vectorizing example – vectorize now

module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b or c or carry) begin

carry[0:2] = (b[1:3]&c[1:3]) | (c[1:3]&carray[1:3]) |

(carry[1:3]&b[1:3]);cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);a = b ^ c ^ carry;

endendmodule

Pattern Matching

always @(b or c) begina = b+ c;

end

Page 48: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

48

Two state vs. four state logic

Extra overhead in 4-state HW Do we want HW with unknown states?! 2-state logic can be 3-5 times faster!

But… Hard to find regions without unknowns

Use interprocedural analysis Check for unknowns, but default to 2-state

Test for detecting unknown is low cost 2-3 instructions

Page 49: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

49

Rewriting block conditions

Semantics of synchronous block: Changes are updated with every clock update

In Verliog: Recompute results every clock tick

always @(posedge(clk)) begin

sum = op1 ^ op2 ^ c_in;

c_out = (op1 & op2) | (op2 &

c_in) | (c_in & op1)

end

Page 50: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

50

Rewriting block conditions (cont.) Actually, HW only computes when input changes

Clocking is simply a matter of gating results through a register

Rewrite code to achieve the same in the simulator Avoid excessive computations

always @(op1 or op2 or c_in) begin

t_sum = op1 ^ op2 ^ c_in;

t_c_out = (op1 & op2) | …

end

always @(posedge(clk)) begin

sum = t_sum;

c_out = t_c_out;

end

Page 51: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

51

Using Basic Compiler Optimizations

Mostly useful for high level abstraction

Optimize inside an always block Control flow between blocks is too

complex Useful methods:

Loop vecorization Constant propagation Dead code elimination Common subexpression elimination

Page 52: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

52

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

Page 53: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

53

Synthesis Optimization

Goal: automatically insert the details Analogous to standard compilers Harder than standard compilers

Not targeted towards a fixed target Many goals:

Minimize cycle time Minimize area Minimize power consumption

Page 54: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

54

Basic Framework

Fundamental problem: reduce this computation to a series of gates

for(i=0; i<100; i++)t = t + a[i]*b[i];

The simple approach: convert add\multiply into gates, optimize later Not so easy in the real world

Converting high level to gates is inefficient, better select components first Various ways to perform multiply, add, etc. Need a good library to select from Need to make the optimal selection

Page 55: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

55

Basic Framework (cont.)

One option:t = ADD(t, MUL(LOAD(A[i]), LOAD(B[i])));

Will take 3 cycles Another option:t = MAC(t, MUL(LOAD(A[i]), LOAD(B[i])));

Only 2 cycles… A lot more options for the unrolled

version:for(i=0; i<100; i+=4)

t = t + a[i]*b[i] + a[i+1]*b[i+1] +a[i+2]*b[i+2] +

a[i+3]*b[i+3];

Page 56: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

56

Basic Framework (cont.)

Analogous to instruction selection for CISC architecture Extensively researched Fast & effective tree-matching

algorithms, applicable to synthesis For the undefined target

Algorithms will adapt to current HW Achieve multiple goals

Bound types & number of functional units

Just need to minimize time now…

Page 57: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

57

Loop Transformations Execution order affects functional unit

utilization & synthesized HW efficiency

for(i=0; i<100;i++) { t[i] = 0; for(j=0; j<3; j++) t[i] = t[i] + (a[i-j]>>2);}

for(i=0; i<100; i++) { o[i] = 0; for(j=0; j<100; j++) o[i] = o[i] + m[i][j] * t[j]}

Page 58: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

58

Loop Transformations (cont.)

Distribute the loops & rearrange topologically:

for(i=0; i<100; i++)

t[i] = 0;

for(i=0; i<100; i++)

o[i] = 0;

for(i=0; i<100; i++)

for(j=0; j<3; j++)

t[i] = t[i] + (a[i-j] >> 2)

for(i=0; i<100; i++)

for(j=0; j<100; j++)

o[i] = o[i] + m[i][j] * t[j];

No improvement

so far…

Page 59: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

59

Loop Transformations (cont.) Let’s try some fusion…

Can you see the interchange?

for(i=0; i<100; i++)

o[i] = 0;

for(i=0; i<100; i++)

t[i] = 0;

for(j=0; j<3; j++)

t[i] = t[i] + (a[i-j] >> 2);

for(j=0; j<100; j++)

o[j] = o[j] + m[j][i] * t[i];

Page 60: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

60

Loop Transformations (cont.) Scalar replacement on t Exploit input dependence on a

for(i=0; i<100; i++) { o[i] = 0; a0 = a[0]; a1 = a[-1]; a2 = a[-2]; a3 = a[-3]; for(i=0; i<100; i++) { t = 0; t = t + (a0>>2) + (a1>>2) + (a2>>2) + (a3>>2) a3 = a2; a2 = a1; a1 = a0; a0 = a[i+1]; for(j=0; j<100; j++) o[j] = o[j] + m[j][I] * t; }}

Page 61: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

61

Loop transformation - summary

Loop fusion: fuse 2 loops which use different functional units

Loop distribution: separate loops using the same functional units

Vectorization: when a functional unit can be pipelined

Loop interchange: mostly to help other transformations

Page 62: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

62

Control and Data Flow Von Neumann architecture

Data flow = data movement among memory and registers

Control flow = changes in the PC due to sequential execution and branches

Synthesized hardware Data flow = data movement among

functional units Control flow = which functional unit should

be active on what data at which time steps Requires a state machine

Page 63: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

63

Control and Data Flow –special HW constructs

Wires Immediate data transfer

Latches Values hold throughout one clock cycle

Registers Static variables in c Held in one or more clock cycle

Memories Arrays in C Special registers: large, long lifetime

Page 64: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

64

Memory Reduction

Memory access is slower than unit access strive to minimize frequency & volume of

access Application of techniques

Loop interchange Loop fusion Scalar replacement Strip mining Unroll and jam Prefetching

Page 65: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

65

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

Page 66: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

66

Summary

Application of dependence is not limited to Fortran!

Analysis framework can be adapted to C Ardent Titan compiler

Several techniques are useful for HW simulation & synthesis

Early stage of research…

Page 67: 1 Dependence in C and Hardware Design Allen and Kennedy, Chapter 12 Presented by Tali Shragai.

67

Questions???

Thanks for listening…