Copy Propagation Optimizations for VLIW DSP Processors with...
Transcript of Copy Propagation Optimizations for VLIW DSP Processors with...
Copy Propagation Optimization for VLIW DSP Processors with Distributed Register Files
Chung-Ju WuChung-Ju WuSheng-Yuan ChenSheng-Yuan Chen
Jenq-Kuen LeeJenq-Kuen Lee
Department of Computer ScienceDepartment of Computer ScienceNational Tsing Hua UniversityNational Tsing Hua University
Hsinchu, TaiwanHsinchu, Taiwan
2
OutlineIntroduction & Background
Issues & Motivations
Enhanced Data Flow Analysis (EDFA)Cost Models
Algorithm
Running Examples for Algorithm
Experimental Results & Summary
3
Introduction & BackgroundMulti-port Design
Not good on Area, Access Time, and Power
Modern VLIW DSP designCluster-based architectureDistributed register filesNot general purpose register anymore !!
Load / Store Unit
Private Registers (A)
Arithmetic Unit
Public Registers (D)
Private Registers (AC)
c lus t e r 2
Memory Interface
Private Registers (A)
Arithmetic Unit
Public Registers (D)
Private Registers (AC)
c lus t e r 1
Load / Store Unit
Program Sequence Control Unit
Scalar Unit
Private Registers (R)
FU FU FU FU…
Register File
• Impact on Compiler -- Instruction Scheduling/Register Allocation -- Software Pipelining -- Global Register Allocation -- Data Flow Optimization
4
Distributed Register FileImpact on Compiler Techniques
Instruction Scheduling & Register Allocation“Instruction scheduling for clustered VLIW DSPs”, R. Leupers, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2000.“Register Allocation for VLIW DSP Processors with Irregular Register Files”, Yung-Chia Lin, Yi-Ping You, and Jenq Kuen Lee, in Proceedings of Compilers for Parallel Computers (CPC), Jan 2006.
Software Pipelining“Optimizing Loop Performance for Clustered VLIW Architectures”, Yi Qian, Philip Sweany, Steve Carr, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2002.
GRA (Global Register Allocation)“Global Register Partitioning”, Jason Hiser, Steve Carr, in Proceedings International Conference on Parallel Architectures and Compilation Techniques (PACT), 2000.
Data Flow OptimizationsCopy PropagationCommon Subexpressions Elimination
5
Copy PropagationReplace variables with earlier equivalents at compiler time.
Reduce some data dependency.
Exploit available optimizations of programs.
What does it happen if we apply conventional copy propagation on distributed register files?
(1) Y := X;(2) Z := X;(3) C := A + B + X;
(1) Y := X;(2) Z := Y;(3) C := A + B + Z;
(1) Y := X;(2) Z := X;(3) C := A + B + Z;
Optimization: Copy Propagation
6
PAC DSP ArchitectureParallel Architecture Core
Developed by STC (System-on-chip Technology Center) of Industrial Technology Research Institute in Taiwan.Potentially for audio/video applications.
PAC feature32-bit, 5-way issue VLIWClustered DesignDistributed Register Files
Port of D register file3 read port / 2 write portfor D[0-15] register files
1r 1w for Scalar2r 1w for ALU/LSU
Presented in Microprocessor Forum 2006
7
Motivation Example I(Cluster Communication)
LSU
ALU
A
AC
D
LSU
ALU
A
AC
D
Cluster 1 Cluster 2
x,t6,t8 t3
What if ……C1<x,t6,t8> and C2<t3> ?
x := t3
t3 + t6
t3 + t8
Performance Anomaly !!
8
Motivation Example II(Register Nature with Irregular Accessibility)
LSU
ALU
A
AC
D
(3) ADD d4,a1,ac1
(5) ADD d6,a1,ac2
a1
ac1,ac2
CAN NOT use A and AC register as operands concurrently !!
Compiler must insert extra copy assignment to move data into D register file!
(3.1) MOV d5,a1
(3.2) ADD d4,d5,ac1
(5.1) MOV d5,a1
(5.2) ADD d6,d5,ac2
9
Motivation Example III(Port Pressure)
10
cluster communicationIf the copy propagation occurs between clusters, we might have more communication overhead.
register nature with irregular accessibilityPrivate registers can only be accessible by the corresponding functional units.
port pressureLimited number of read/write port causes scheduler to separate code into different packages.
Performance Anomaly
11
Enhanced Data Flow Analysis
DefinitionAt every propagation decision point, for every propagation from variable n to variable m, say (n, m), a data flow profit P, is computed:P = Gain(n, m) – Cost(n, m) Gain(n, m): the reduced cost by applying copy propagation. Cost(n, m): the penalty if copy propagation is performed.
Gain() ≥ Cost() Good !
Gain() < Cost() Bad !
12
EDFA Cost Function Gain(n,m)
Gain(n, m) =
RCC(n, m): the reduced communication costs by propagating n to m.
ACA(c[j]): return the number of all available copy assignments which can be reduced along this
n m path.
13
EDFA Cost Function Cost(n,m)
CBC(n, m) : return the cost of propagating across clusters.RP(n, m) : return the extra copy assignment to move
data between private registers.
PP(n, m) =
Cost(n, m) = CBC(n, m) + RP(n, m) + PP(n, m)
14
EDFA algorithmEnhanced Data Flow Analysis (EDFA) algorithm
Data flows between nodes form an acyclic Data Flow Graph.
Perform conventional copy propagation without propagating variables.
Collect all possible propagation path,recalculating the profit, and output the revised result.
I
MI
M
I
I
M
M
I
I
M
M
I
I
M
I
M
Variablesin
Statements
I
MI
M
I
I
M
M
I
I
M
M
I
I
M
I
M
15
EDFA Estimation Algorithm
Sharing Edges on propagation tree.
Traverse the propagation path, revise the cost
More accurate on evaluating gain
half cost
16
Example
P = Gain(n, m) – Cost(n, m)
Cluster 1 Cluster 2
AC register fileA register fileD register file
Pa−c = 1 − 3 = -2
Pb−c = 1 − 0 = 1
M ILSU ALU
better!!
Communication between clusters: 3 cycles
Each instruction cost: 1 cycle
17
Example (cont’)
P = Gain(n, m) – Cost(n, m)
Cluster 1 Cluster 2
AC register fileA register fileD register file
Pa−c = 6 − 1 = 5
M ILSU ALU
good!!
Communication between clusters: 3 cycles
Each instruction cost: 1 cycle
18
PACDSP CompilerBased on Open Research Compiler (ORC)Intermediate Representation:WHIRL (Winning Hierarchical Intermediate Representation Language)Low Power Optimization (On-going work), TODAES’06Register Allocation
SA (Simulated Annealing), LCPC’05PALF (Ping-pong Aware Local Favorable), CPC’06
EBO carries out peephole optimizations.
EDFA is implemented in EBO phase.
Experiment Platform
Target Info.
Code Emission
Hyperblock Formation & IF-conversion
Loop Optimization
Cluster-aware GRA
Global Scheduling
Low Power Optimization
SA-LRAPALF-LRA
EBO Peephole Optimization
Local Instruction Scheduling
WHIRL-level Optimizers (IPA, WOPT, LNO, …)
Front-end
Assembly Codes
Source Codes
SWP
Lowering & Code Selection & Intrinsics
SIMD & DSP Optimizations
PA
C code generator
New Phasefor PAC
SpecificallyTuned
for PAC
Ported for Target
Dependency
EBO Peephole Optimization
19
Experiment Platform
EnvironmentPAC DSP Compiler (using ORC infrastructure)
PAC DSP binutils(modified GNU binutils)
Instruction Set SimulatorCycle accurate
BenchmarkDSP-stone
SystemSoftware Development Suite
Profiler
Debugger
Libraries
Assembler Linker
C Compiler
InstructionSet
Simulator
20
Experimental Result
40%
50%
60%
70%
80%
90%
100%
110%
120%
mat1x3_BB_3
dot_product_BB_3
fir2dim_BB_5
real_update_BB_2
n_real_update_BB_3
matrix1_BB_5
matrix2_BB_5
matrix2_BB_7
convolution_BB_2
No Propagation Original Data Flow Analysis Enhanced Data Flow Analysis
21
ConclusionSummary
We address the conventional data-flow equations over distributed register files.We propose an Enhanced Data Flow Analysis (EDFA) framework for compilers to avoid performance anomaly.EDFA keeps the advantage of copy propagation optimization.
Future WorkIntegrate with common sub-expression elimination module.
22
Thank You !!