Copy Propagation Optimizations for VLIW DSP Processors with...

Copy Propagation Optimization for VLIW DSP Processors with Distributed Register Files

Chung-Ju WuChung-Ju WuSheng-Yuan ChenSheng-Yuan Chen

Jenq-Kuen LeeJenq-Kuen Lee

Department of Computer ScienceDepartment of Computer ScienceNational Tsing Hua UniversityNational Tsing Hua University

Hsinchu, TaiwanHsinchu, Taiwan

2

OutlineIntroduction & Background

Issues & Motivations

Enhanced Data Flow Analysis (EDFA)Cost Models

Algorithm

Running Examples for Algorithm

Experimental Results & Summary

3

Introduction & BackgroundMulti-port Design

Not good on Area, Access Time, and Power

Modern VLIW DSP designCluster-based architectureDistributed register filesNot general purpose register anymore !!

Load / Store Unit

Private Registers (A)

Arithmetic Unit

Public Registers (D)

Private Registers (AC)

c lus t e r 2

Memory Interface

Private Registers (A)

Arithmetic Unit

Public Registers (D)

Private Registers (AC)

c lus t e r 1

Load / Store Unit

Program Sequence Control Unit

Scalar Unit

Private Registers (R)

FU FU FU FU…

Register File

• Impact on Compiler -- Instruction Scheduling/Register Allocation -- Software Pipelining -- Global Register Allocation -- Data Flow Optimization

4

Distributed Register FileImpact on Compiler Techniques

Instruction Scheduling & Register Allocation“Instruction scheduling for clustered VLIW DSPs”, R. Leupers, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2000.“Register Allocation for VLIW DSP Processors with Irregular Register Files”, Yung-Chia Lin, Yi-Ping You, and Jenq Kuen Lee, in Proceedings of Compilers for Parallel Computers (CPC), Jan 2006.

Software Pipelining“Optimizing Loop Performance for Clustered VLIW Architectures”, Yi Qian, Philip Sweany, Steve Carr, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2002.

GRA (Global Register Allocation)“Global Register Partitioning”, Jason Hiser, Steve Carr, in Proceedings International Conference on Parallel Architectures and Compilation Techniques (PACT), 2000.

Data Flow OptimizationsCopy PropagationCommon Subexpressions Elimination

5

Copy PropagationReplace variables with earlier equivalents at compiler time.

Reduce some data dependency.

Exploit available optimizations of programs.

What does it happen if we apply conventional copy propagation on distributed register files?

(1) Y := X;(2) Z := X;(3) C := A + B + X;

(1) Y := X;(2) Z := Y;(3) C := A + B + Z;

(1) Y := X;(2) Z := X;(3) C := A + B + Z;

Optimization: Copy Propagation

6

PAC DSP ArchitectureParallel Architecture Core

Developed by STC (System-on-chip Technology Center) of Industrial Technology Research Institute in Taiwan.Potentially for audio/video applications.

PAC feature32-bit, 5-way issue VLIWClustered DesignDistributed Register Files

Port of D register file3 read port / 2 write portfor D[0-15] register files

1r 1w for Scalar2r 1w for ALU/LSU

Presented in Microprocessor Forum 2006

7

Motivation Example I(Cluster Communication)

LSU

ALU

A

AC

D

LSU

ALU

A

AC

D

Cluster 1 Cluster 2

x,t6,t8 t3

What if ……C1<x,t6,t8> and C2<t3> ?

x := t3

t3 + t6

t3 + t8

Performance Anomaly !!

8

Motivation Example II(Register Nature with Irregular Accessibility)

LSU

ALU

A

AC

D

(3) ADD d4,a1,ac1

(5) ADD d6,a1,ac2

a1

ac1,ac2

CAN NOT use A and AC register as operands concurrently !!

Compiler must insert extra copy assignment to move data into D register file!

(3.1) MOV d5,a1

(3.2) ADD d4,d5,ac1

(5.1) MOV d5,a1

(5.2) ADD d6,d5,ac2

9

Motivation Example III(Port Pressure)

10

cluster communicationIf the copy propagation occurs between clusters, we might have more communication overhead.

register nature with irregular accessibilityPrivate registers can only be accessible by the corresponding functional units.

port pressureLimited number of read/write port causes scheduler to separate code into different packages.

Performance Anomaly

11

Enhanced Data Flow Analysis

DefinitionAt every propagation decision point, for every propagation from variable n to variable m, say (n, m), a data flow profit P, is computed:P = Gain(n, m) – Cost(n, m) Gain(n, m): the reduced cost by applying copy propagation. Cost(n, m): the penalty if copy propagation is performed.

Gain() ≥ Cost() Good !

Gain() < Cost() Bad !

12

EDFA Cost Function Gain(n,m)

Gain(n, m) =

RCC(n, m): the reduced communication costs by propagating n to m.

ACA(c[j]): return the number of all available copy assignments which can be reduced along this

n m path.

13

EDFA Cost Function Cost(n,m)

CBC(n, m) : return the cost of propagating across clusters.RP(n, m) : return the extra copy assignment to move

data between private registers.

PP(n, m) =

Cost(n, m) = CBC(n, m) + RP(n, m) + PP(n, m)

14

EDFA algorithmEnhanced Data Flow Analysis (EDFA) algorithm

Data flows between nodes form an acyclic Data Flow Graph.

Perform conventional copy propagation without propagating variables.

Collect all possible propagation path,recalculating the profit, and output the revised result.

I

MI

M

I

I

M

M

I

I

M

M

I

I

M

I

M

Variablesin

Statements

I

MI

M

I

I

M

M

I

I

M

M

I

I

M

I

M

15

EDFA Estimation Algorithm

Sharing Edges on propagation tree.

Traverse the propagation path, revise the cost

More accurate on evaluating gain

half cost

16

Example

P = Gain(n, m) – Cost(n, m)

Cluster 1 Cluster 2

AC register fileA register fileD register file

Pa−c = 1 − 3 = -2

Pb−c = 1 − 0 = 1

M ILSU ALU

better!!

Communication between clusters: 3 cycles

Each instruction cost: 1 cycle

17

Example (cont’)

P = Gain(n, m) – Cost(n, m)

Cluster 1 Cluster 2

AC register fileA register fileD register file

Pa−c = 6 − 1 = 5

M ILSU ALU

good!!

Communication between clusters: 3 cycles

Each instruction cost: 1 cycle

18

PACDSP CompilerBased on Open Research Compiler (ORC)Intermediate Representation:WHIRL (Winning Hierarchical Intermediate Representation Language)Low Power Optimization (On-going work), TODAES’06Register Allocation

SA (Simulated Annealing), LCPC’05PALF (Ping-pong Aware Local Favorable), CPC’06

EBO carries out peephole optimizations.

EDFA is implemented in EBO phase.

Experiment Platform

Target Info.

Code Emission

Hyperblock Formation & IF-conversion

Loop Optimization

Cluster-aware GRA

Global Scheduling

Low Power Optimization

SA-LRAPALF-LRA

EBO Peephole Optimization

Local Instruction Scheduling

WHIRL-level Optimizers (IPA, WOPT, LNO, …)

Front-end

Assembly Codes

Source Codes

SWP

Lowering & Code Selection & Intrinsics

SIMD & DSP Optimizations

PA

C code generator

New Phasefor PAC

SpecificallyTuned

for PAC

Ported for Target

Dependency

EBO Peephole Optimization

19

Experiment Platform

EnvironmentPAC DSP Compiler (using ORC infrastructure)

PAC DSP binutils(modified GNU binutils)

Instruction Set SimulatorCycle accurate

BenchmarkDSP-stone

SystemSoftware Development Suite

Profiler

Debugger

Libraries

Assembler Linker

C Compiler

InstructionSet

Simulator

20

Experimental Result

40%

50%

60%

70%

80%

90%

100%

110%

120%

mat1x3_BB_3

dot_product_BB_3

fir2dim_BB_5

real_update_BB_2

n_real_update_BB_3

matrix1_BB_5

matrix2_BB_5

matrix2_BB_7

convolution_BB_2

No Propagation Original Data Flow Analysis Enhanced Data Flow Analysis

21

ConclusionSummary

We address the conventional data-flow equations over distributed register files.We propose an Enhanced Data Flow Analysis (EDFA) framework for compilers to avoid performance anomaly.EDFA keeps the advantage of copy propagation optimization.

Future WorkIntegrate with common sub-expression elimination module.

22

Thank You !!

Copy Propagation Optimizations for VLIW DSP Processors with...

Documents

Transcript of Copy Propagation Optimizations for VLIW DSP Processors with...