Design and Implementation of a TriCore Backend for the LLVM … · Register Allocation Virtual...
Transcript of Design and Implementation of a TriCore Backend for the LLVM … · Register Allocation Virtual...
Design and Implementation of a TriCore Backendfor the LLVM Compiler Framework
Studienarbeit
Christoph Erhardt
Friedrich-Alexander-Universitat Erlangen-Nurnberg
November 20, 2009
A TriCore Backend for LLVM (November 20, 2009) 1 – 25
Overview
Overview
The TriCore Processor Architecture
The LLVM Compiler Infrastructure
Design and Implementation of the Backend
Evaluation & Conclusion
A TriCore Backend for LLVM (November 20, 2009) Overview 2 – 25
Motivation What do we need it for?
TriCore chips are omnipresent around here:
QuadcopterHigh strikerCarolo Cup...
The RTSC (Real-Time Systems Compiler) project:
Operating system aware compiler for real-time applicationsProcesses atomic basic blocksBased on LLVM
RTSC should be able to generate TriCore machine code
A TriCore Backend for LLVM (November 20, 2009) Overview 3 – 25
Motivation What do we need it for?
TriCore chips are omnipresent around here:
QuadcopterHigh strikerCarolo Cup...
The RTSC (Real-Time Systems Compiler) project:
Operating system aware compiler for real-time applicationsProcesses atomic basic blocksBased on LLVM
RTSC should be able to generate TriCore machine code
A TriCore Backend for LLVM (November 20, 2009) Overview 3 – 25
The TriCore Processor Architecture Overview
Three-in-one architecture
Real-time microcontroller unit
DSP
Superscalar RISC processor
Basic features
Load/store architecture
32-bit data, address, and instruction words
Some special 16-bit instruction words for higher code density
Little-endian byte order
16 data + 16 address registers
A TriCore Backend for LLVM (November 20, 2009) The TriCore Processor Architecture 4 – 25
The TriCore Processor Architecture Overview
Three-in-one architecture
Real-time microcontroller unit
DSP
Superscalar RISC processor
Basic features
Load/store architecture
32-bit data, address, and instruction words
Some special 16-bit instruction words for higher code density
Little-endian byte order
16 data + 16 address registers
A TriCore Backend for LLVM (November 20, 2009) The TriCore Processor Architecture 4 – 25
Peculiarities Some things that TriCore handles in an unusual way
Strict distinction between data and address registers:
Also reflected in the calling conventionsSerious problem for the compiler!
Data registers are also used for floating-point operands
Special DSP-oriented instructions and addressing modes
Task/context model:
Automatic context save/restore upon call/returnContext save areas (linked lists managed by hardware)
A TriCore Backend for LLVM (November 20, 2009) The TriCore Processor Architecture 5 – 25
The LLVM Compiler Infrastructure Overview
Open-source compiler infrastructure project started in 2000
Main sponsor: Apple Inc.
Written in C++
A TriCore Backend for LLVM (November 20, 2009) The LLVM Compiler Infrastructure 6 – 25
Basic Architecture The classical three tiers of a compiler
x86assembly
SPARCassembly
C source
Fortransource
LLVM-GCCfrontend Optimizer
x86 codegenerator
SPARC codegenerator
Clangfrontend
... ...
LLVMassembly/
bitcode
LLVMassembly/
bitcode
... ...
Language-specific frontends
Optimizer: generic IR, analysis/transformation passes
Several backends for machine code generation
A TriCore Backend for LLVM (November 20, 2009) The LLVM Compiler Infrastructure 7 – 25
Unique Characteristics What does LLVM have that others don’t?
Not merely a compiler, but a compiler infrastructure:
Static compilationJust-in-time compilation
Strictly modular, library-based architecture:
Easily extendiblePossibility to incorporate parts of LLVM in other projects
BSD-style licence
Produces highly optimized machine code in an efficient way:
Memory-efficientTime-efficient
A TriCore Backend for LLVM (November 20, 2009) The LLVM Compiler Infrastructure 8 – 25
Design and Implementation of the Backend Overview
Extensive generic code generation framework:
Makes work a lot easier... but also imposes some problems in specific cases
Fixed class hierarchy
Many target-independent algorithms:
Instruction schedulingRegister colouring...
Code generation process executed by a series of passes
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 9 – 25
Code Generation Process
DAGLowering
DAGLegalization
InstructionSelection
SchedulingSSA-BasedOptimization
RegisterAllocation
Pro-/EpilogueInsertion
PeepholeOptimization
AssemblyPrinting
ListLLVM code(SSA form)
DAGsnot legalized
DAGslegalized
DAGsnative
instructions
ListSSA form
ListSSA form
Listwith physical
registers
Listwith resolved
stackreferences
Listwith resolved
stackreferences
Textassembly
code
TriCoreTargetLowering TriCoreDAGToDAGISelTriCoreInstrInfo
TriCoreRegisterInfoTriCoreInstrInfoTriCoreAsmPrinter
TriCoreInstrInfo
TriCoreInstrInfoTriCoreLoadStoreOpt
Post-AllocationPasses
Listwith physical
registers
TriCoreVirtInstrResolver
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 10 – 25
TableGen One tool to rule them all...
Problem
Backend contains large portions of descriptive data
C++ obviously not suitable
TableGen
Language for domain-specific modelling
Similar to object-oriented approach:
Classes, records (objects), attributesInheritance
Definition files (.td) preprocessed by tblgen tool→ Auto-generation of C++ code
Used for description of:
Subtargets, registersCalling conventionsInstruction set
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 11 – 25
TableGen One tool to rule them all...
Problem
Backend contains large portions of descriptive data
C++ obviously not suitable
TableGen
Language for domain-specific modelling
Similar to object-oriented approach:
Classes, records (objects), attributesInheritance
Definition files (.td) preprocessed by tblgen tool→ Auto-generation of C++ code
Used for description of:
Subtargets, registersCalling conventionsInstruction set
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 11 – 25
SelectionDAG Construction Largely automated
Directed acyclic graph
Per basic block
Nodes: instructions
Edges:
Data dependenciesControl flow dependencies
Example
%mul = mul i32 %a, %a
%mul4 = mul i32 %b, %b
%add = add nsw i32 %mul4, %mul
ret i32 %add
↗
isel input for euclidSquare:entry
EntryToken
0xa8321e8
ch
Register %reg1024
0xa832900
i32
Register %reg1025
0xa832988
i32
Register %D2
0xa8327f0
i32
CopyFromReg
0xa832a10
i32 ch
CopyFromReg
0xa832878
i32 ch
mul
0xa8325d0
i32
mul
0xa832548
i32
add
0xa832658
i32
CopyToReg
0xa8324c0
ch flag
TriCoreISD::RET_FLAG
0xa8326e0
ch
GraphRoot
SelectionDAG Construction Largely automated
Directed acyclic graph
Per basic block
Nodes: instructions
Edges:
Data dependenciesControl flow dependencies
Example
%mul = mul i32 %a, %a
%mul4 = mul i32 %b, %b
%add = add nsw i32 %mul4, %mul
ret i32 %add
↗
isel input for euclidSquare:entry
EntryToken
0xa8321e8
ch
Register %reg1024
0xa832900
i32
Register %reg1025
0xa832988
i32
Register %D2
0xa8327f0
i32
CopyFromReg
0xa832a10
i32 ch
CopyFromReg
0xa832878
i32 ch
mul
0xa8325d0
i32
mul
0xa832548
i32
add
0xa832658
i32
CopyToReg
0xa8324c0
ch flag
TriCoreISD::RET_FLAG
0xa8326e0
ch
GraphRoot
Troubles The integer vs. pointer problem
Problem
TriCore strictly distinguishes between addresses and data integers
Have to be put into separate register files → calling conventions!
LLVM’s backend framework treats pointers just like integers...
Solution
Annotation of “pointer / no pointer” flag in value type class
Promotion of this flag throughout the DAG construction phase(required some hacks...)
Case differentiations in all relevant situations
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 13 – 25
Troubles The integer vs. pointer problem
Problem
TriCore strictly distinguishes between addresses and data integers
Have to be put into separate register files → calling conventions!
LLVM’s backend framework treats pointers just like integers...
Solution
Annotation of “pointer / no pointer” flag in value type class
Promotion of this flag throughout the DAG construction phase(required some hacks...)
Case differentiations in all relevant situations
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 13 – 25
Instruction Selection Largely auto-generated
isel input for euclidSquare:entry
EntryToken
0xa8321e8
ch
Register %reg1024
0xa832900
i32
Register %reg1025
0xa832988
i32
Register %D2
0xa8327f0
i32
CopyFromReg
0xa832a10
i32 ch
CopyFromReg
0xa832878
i32 ch
mul
0xa8325d0
i32
mul
0xa832548
i32
add
0xa832658
i32
CopyToReg
0xa8324c0
ch flag
TriCoreISD::RET_FLAG
0xa8326e0
ch
GraphRoot
Pattern matching→
def MULrr2 : Rr2Instr<0x0a,
(outs DR:$c), (ins DR:$a, DR:$b),
"mul\t$c, $a, $b",
[(set DR:$c, (mul DR:$a, DR:$b))]>;
scheduler input for euclidSquare:entry
EntryToken
0xa8321e8
ch
Register %reg1024
0xa832900
i32
Register %reg1025
0xa832988
i32
Register %D2
0xa8327f0
i32
CopyFromReg
0xa832a10
i32 ch
CopyFromReg
0xa832878
i32 ch
MULrr2
0xa832548
i32
MADDrrr2
0xa832658
i32
CopyToReg
0xa8324c0
ch flag
RETsys
0xa8326e0
ch
GraphRoot
Instruction Selection Largely auto-generated
isel input for euclidSquare:entry
EntryToken
0xa8321e8
ch
Register %reg1024
0xa832900
i32
Register %reg1025
0xa832988
i32
Register %D2
0xa8327f0
i32
CopyFromReg
0xa832a10
i32 ch
CopyFromReg
0xa832878
i32 ch
mul
0xa8325d0
i32
mul
0xa832548
i32
add
0xa832658
i32
CopyToReg
0xa8324c0
ch flag
TriCoreISD::RET_FLAG
0xa8326e0
ch
GraphRoot
Pattern matching→
def MULrr2 : Rr2Instr<0x0a,
(outs DR:$c), (ins DR:$a, DR:$b),
"mul\t$c, $a, $b",
[(set DR:$c, (mul DR:$a, DR:$b))]>;
scheduler input for euclidSquare:entry
EntryToken
0xa8321e8
ch
Register %reg1024
0xa832900
i32
Register %reg1025
0xa832988
i32
Register %D2
0xa8327f0
i32
CopyFromReg
0xa832a10
i32 ch
CopyFromReg
0xa832878
i32 ch
MULrr2
0xa832548
i32
MADDrrr2
0xa832658
i32
CopyToReg
0xa8324c0
ch flag
RETsys
0xa8326e0
ch
GraphRoot
Scheduling & Register Allocation Target-independent algorithms
Scheduling
DAGs → list (SSA form)
Target-independent algorithm using data from the instructiondescription table
Register Allocation
Virtual registers → physical registers
SSA deconstruction
Target-independent colouring algorithm using the registerinformation table
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 15 – 25
Scheduling & Register Allocation Target-independent algorithms
Scheduling
DAGs → list (SSA form)
Target-independent algorithm using data from the instructiondescription table
Register Allocation
Virtual registers → physical registers
SSA deconstruction
Target-independent colouring algorithm using the registerinformation table
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 15 – 25
Final passes Handwork
Virtual Instruction Resolution
Some instructions (e. g., moves) had operands of unknownregister classes at the time of their creation
Now that physical registers have been assigned, these instructionscan be resolved
Pre-/Epilogue Insertion
Insertion of pre-/epilogue code to entry/exits of all functions
Virtual stack slots → physical stack frame references
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 16 – 25
Final passes Handwork
Virtual Instruction Resolution
Some instructions (e. g., moves) had operands of unknownregister classes at the time of their creation
Now that physical registers have been assigned, these instructionscan be resolved
Pre-/Epilogue Insertion
Insertion of pre-/epilogue code to entry/exits of all functions
Virtual stack slots → physical stack frame references
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 16 – 25
Code Emission Partly auto-generated
Peephole Optimization
Merging of two subsequent 32-bit loads/stores into a single 64-bitload/store
# Before:st.w [%a10]4, %d9st.w [%a10]0, %d8
# After:st.d [%a10]0, %e8
Assembly Printing
Output of assembly code in text form
Large parts auto-generated from the instruction description table
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 17 – 25
Code Emission Partly auto-generated
Peephole Optimization
Merging of two subsequent 32-bit loads/stores into a single 64-bitload/store
# Before:st.w [%a10]4, %d9st.w [%a10]0, %d8
# After:st.d [%a10]0, %e8
Assembly Printing
Output of assembly code in text form
Large parts auto-generated from the instruction description table
A TriCore Backend for LLVM (November 20, 2009) Design and Implementation of the Backend 17 – 25
In practice How do I use it?
Required software
Clang compiler frontend (support has been integrated)
GNU Binutils for TriCore:
AssemblerLinker
Headers and libraries from TriCore-GCC
Small Perl wrapper script
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 18 – 25
Comparison to GCC
Criteria
1. Compilation speed
2. Code size
3. Code performance
Testing system
Benchmark application: CoreMark
Compilation PC: Core 2 Quad Q6600 (2.40 GHz)
Runtime system: TC1796 board (40 MHz)
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 19 – 25
Comparison to GCC
Criteria
1. Compilation speed
2. Code size
3. Code performance
Testing system
Benchmark application: CoreMark
Compilation PC: Core 2 Quad Q6600 (2.40 GHz)
Runtime system: TC1796 board (40 MHz)
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 19 – 25
The CoreMark Benchmark
Alternative to the well-known Dhrystone benchmark
C source code publicly available (albeit not open source software)
Easily portable
Prevents compilers from “cheating” by optimizing away unusedcomputation results
Operations:
Linked list processingMatrix manipulationState machine operationsCRC computation
Results can be validated
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 20 – 25
Compilation Time
0
50
100
150
200
250
300
350
400
450
500
-O0 -O1 -O2 -O3 -Os
Com
pila
tion
time
(in m
illise
cond
s)
Optimization level
GCCLLVM
Takes about 10 % less time than GCC
Even faster when compiling at -O0
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 21 – 25
Code Size
0
5
10
15
20
25
30
-O0 -O1 -O2 -O3 -Os
Size
of t
he te
xt s
egm
ent (
in K
iB)
Optimization level
GCCLLVM
Generates slightly smaller code
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 22 – 25
Code Performance
0
5
10
15
20
25
30
35
40
-O0 -O1 -O2 -O3 -Os
Itera
tions
per
sec
ond
Optimization level
GCCLLVM
Code 12–20 % slower than the code generated by GCC
Further work needed to become fully competitive
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 23 – 25
Summary
All of the basic functionality is there and is working reliably
Space for further optimizations and extensions
First TriCore compiler to be released under a BSD-style licence!
End goal: inclusion into LLVM’s repository
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 24 – 25
The End
Any Questions?
A TriCore Backend for LLVM (November 20, 2009) Evaluation & Conclusion 25 – 25