Introduction to Compiler Development
-
Upload
logan-chien -
Category
Software
-
view
94 -
download
0
Transcript of Introduction to Compiler Development
INTRO TO COMPILERDEVELOPMENT
LOGAN CHIENhttp://slide.logan.tw/compiler-intro/
LOGAN CHIENSo�ware engineer at MediaTek
Ametuar LLVM/Clang developer
Integrated LLVM/Clang into Android NDK
WHY I LOVE COMPILERS?I have been a faithful reader of Jserv's blog for ten years.
I was inspired by the compiler and the virtual machinetechnologies mentioned in his blog.
I have decided to choose compiler technologies as myresearch topic since then.
UNDERGRADUATE COMPILERCOURSE
I took the undergraduate compiler course when I was asophomore.
MY PROFESSOR SAID ...“We took many lectures to discuss about the parser.
However, when people say they are doing compilerresearch, with large possibility, they are not referring to the
parsing technique.”
I AM HERE TO ...Re-introduce the compiler technologies,
Give a lightening talk on industrial-strength compilerdesign,
Explain the connection between compiler technologies andthe industry.
AGENDARe-introduction to Compiler (30min)
Industrial-strength Compiler Design (90min)
Compiler and ICT Industry (20min)
RE-INTRODUCTION TOCOMPILER
WHAT IS A COMPILER?Compiler are tools for programmers to translate
programmer's thought into computer runnable programs.
ANALOGY — Translators who turn from one language toanother, such as those who translate Chinese to English.
WHAT HAVE WE LEARNED INUNDERGRADUATE COMPILER
COURSE?
LEXERReads the input source code (as a sequence of bytes) and
converts them into a stream of tokens.u n s i g n e d b a c k g r o u n d ( u n s i g n e d f o r e g r o u n d ) { i f ( ( f o r e g r o u n d > > 1 6 ) > 0 x 8 0 ) { r e t u r n 0 ; } e l s e { r e t u r n 0 x f f f f f f ; } }
unsigned background ( unsigned foreground ) { if ( ( foreground >>16 ) > 128 ) { return 0 ; } else { return 16777215 ; } }
PARSERReads the tokens and build an AST according to the syntax.
unsigned background ( unsigned foreground ) { if ( ( foreground >>16 ) > 128 ) { return 0 ; } else { return 16777215 ; } }( p r o c e d u r e b a c k g r o u n d ( a r g s ' ( f o r e g r o u n d ) ) ( c o m p o u n d - s t m t ( i f - s t m t ( b i n - e x p r G E ( b i n - e x p r R S H I F T f o r e g r o u n d 1 6 ) 1 2 8 ) ( r e t u r n - s t m t 0 ) ( r e t u r n - s t m t 1 6 7 7 7 2 1 5 ) ) ) )
CODE GENERATORGenerate the machine code or (assembly) according to the
AST. In the undergraduate course, we usually simply dosyntax-directed translation.
( p r o c e d u r e b a c k g r o u n d ( a r g s ' ( f o r e g r o u n d ) ) ( c o m p o u n d - s t m t ( i f - s t m t ( b i n - e x p r G E ( b i n - e x p r R S H I F T f o r e g r o u n d 1 6 ) 1 2 8 ) ( r e t u r n - s t m t 0 ) ( r e t u r n - s t m t 1 6 7 7 7 2 1 5 ) ) ) )
l s r w 8 , w 0 , # 1 6 c m p w 8 , # 1 2 8 b . l o . L e l s e m o v w 0 , w z r r e t . L e l s e : o r r w 0 , w z r , # 0 x f f f f f f r e t
WHAT'S MISSING?Can a person who can only lex and parse sentences translate
articles well?
COMPILER REQUIREMENTSA compiler should translate the source code precisely.
A compiler should utilize the device efficiently.
THREE RELATED FIELDSProgramming Language
Computer Architecture
Compiler
PROGRAMMING LANGUAGE THEORYEssential component of a programming language: type
theory, variable scoping, language semantics, etc.
How do people reason and compose a program?
Create an abstraction that is understandable to human andtracable to computers.
EXAMPLE: SUBTYPE AND MUTABLERECORDS
Why you can't perform following conversion in C++?v o i d t e s t ( i n t * p t r ) { i n t * * p = & p t r ; c o n s t i n t * * a = p ; / / C o m p i l e r g i v e s w a r n i n g / / . . . }
This is related to covariant type and contravarience type. With PLT, we know that we can only choosetwo of (a) covariant type, (b) mutable records, and (c) type consistency.
v o i d t e s t ( i n t * p t r ) { c o n s t i n t c = 0 ; i n t * * p = & p t r ; c o n s t i n t * * a = p ; / / I f i t i s a l l o w e d , b a d p r o g r a m s w i l l p a s s . * a = & c ; * p = 5 ; / / N o w a r n i n g h e r e .}
EXAMPLE: DYNAMIC SCOPING INBASH
# ! / b i n / s h v = 1 # I n i t i a l i z e v w i t h 1 f o o ( ) { e c h o " f o o : v = $ { v } " # W h i c h v i s r e f e r r e d ? v = 2 # W h i c h v i s a s s i g n e d ? } b a r ( ) { l o c a l v = 3 f o o e c h o " b a r : v = $ { v } " # W h a t w i l l b e p r i n t e d ? } v = 4 # A s s i g n 4 t o v b a r e c h o " v = $ { v } " # W h a t w i l l b e p r i n t e d ?
Ans: foo:v=3 , bar:v=2 , v=4 . Surprisingly, foo isaccessing local v in bar instead of the global v.
EXAMPLE: NON-LEXICAL SCOPING INJAVASCRIPT
/ / J a v a s c r i p t , t h e b a d p a r t f u n c t i o n b a d ( v ) { v a r s u m ; w i t h ( v ) { s u m = a + b ; } r e t u r n s u m ;}
c o n s o l e . l o g ( b a d ( { a : 5 , b : 1 0 } ) ) ; c o n s o l e . l o g ( b a d ( { a : 5 , b : 1 0 , s u m : 1 0 0 } ) ) ;
Ans: The second console.log() prints undefined.
COMPUTER ARCHITECTUREInstruction set architecture: CISC vs. RISC.
Out-of-Order Execution vs. Instruction Scheduling.
Memory hierarchy
Memory model
QUIZ: DO YOU REALLY KNOW C?Is it guaranteed that v will always be loaded a�er pred?i n t p r e d ; i n t v ;
i n t g e t ( i n t a , i n t b ) { i n t r e s ; i f ( p r e d > 0 ) { r e s = v * a - v / b ; } e l s e { r e s = v * a + v / b ; } r e t u r n r e s ;}
Ans: No. Independent reads/writes can be reordered. The standard only requires the result shouldbe the same as running from top to bottom (in a single thread.)
COMPILER ANALYSISData-flow analysis — Analyze value ranges, check theconditions or contraints, figure out modifications tovariables, etc.
Control-flow analysis — Analyze the structure of theprogram, such as control dependency and loop structure.
Memory dependency analysis — Analyze the memoryaccess pattern of the access to array elements or pointerdereferences, e.g. alias analysis.
ALIAS ANALYSISDetermine whether two pointers can refers to the same
object or not.v o i d m o v e ( c h a r * d s t , c o n s t c h a r * s r c , i n t n ) { f o r ( i n t i = 0 ; i < n ; + + i ) { d s t [ i ] = s r c [ i ] ; } }
i n t s u m ( i n t * p t r , c o n s t i n t * v a l , i n t n ) { i n t r e s = 0 ; f o r ( i n t i = 0 ; i < n ; + + i ) { r e s + = * v a l ; * p t r + + = 1 0 ; } r e t u r n r e s ;}
QUIZ: DO YOU KNOW C++?c l a s s Q M u t e x L o c k e r { p u b l i c : u n i o n { Q M u t e x * m t x _ ; u i n t p t r _ t v a l _ ; } ; v o i d u n l o c k ( ) { i f ( v a l _ ) { i f ( ( v a l _ & ( u i n t p t r _ t ) 1 ) = = ( u i n t p t r _ t ) 1 ) { v a l _ & = ~ ( u i n t p t r _ t ) 1 ; m t x _ - > u n l o c k ( ) ; } } } } ;
Pitfall: Reading from union fields that were not writtenpreviously results in undefined behavior. Type-Based Alias
Analysis (TBAA) exploits this rule.
COMPILER OPTIMIZATIONScalar optimization — Fold the constants, remove theredundancies, change expressions with identities, etc.
Vector optimization — Convert several scalar operationsinto one vector operation, e.g. combining for add instructioninto one vector add.
Interprocedural optimization — function inlining,devirtualization, cross-function analysis, etc.
OTHER COMPILER-RELATEDTECHNOLOGYJust-in-time compilers
Binary translators
Program profiling and performance measurement
Facilities to run compiled executables, e.g. garbagecollectors
INDUSTRIAL-STRENGTH COMPILER
DESIGN
What's the difference between yourfinal project and the industrial-
strength compiler?
KEY DIFFERENCEAnalysis — Reasons program structures and changes ofvalues.
Optimization — Applies several provably correcttransformation which should make program run faster.
Intermediate Representation — Data structure on whichanalyses and optimizations are based.
INTERMEDIATE REPRESENTATION1/4
A data structure for program analyses and optimizations.
High-level enough to capture important properties andencapsulates hardware limitation.
Low-level enough to be analyzed by analyses andmanipulated by transformations.
An abstraction layer for multiple front-ends and back-ends.
INTERMEDIATE REPRESENTATION2/4
C/C++
Ada
Fortran
Go
arm
aarch64
mips
nds
x86
GENERIC RTLGIMPLE
Tree SSA
GCC Compiler Pipeline
INTERMEDIATE REPRESENTATION3/4
C/C++
Swift
Rust
Obj‑C
arm
aarch64
mips
sparc
x86
MachInstrSelDAGLLVM IR
LLVM Compiler Pipeline
INTERMEDIATE REPRESENTATION4/4
GCC — GENERIC, GIMPLE, Tree SSA, and RTL.
LLVM — LLVM IR, Selection DAG, and Machine Instructions.
Java HotSpot — HIR, LIR, and MIR.
CONTROL FLOW GRAPH1/3
Basic block — A sequence of instructions that will only beentered from the top and exited from the end.
Edge — If the basic block s may branch to t, then we add adirected edge (s, t).
Predecessor/Successor — If there is an edge (s, t), then s is apredecessor of t and t is a successor of s.
CONTROL FLOW GRAPH2/3
/ / y = a * x + b ; v o i d m a t m u l ( d o u b l e * r e s t r i c t y , u n s i g n e d l o n g m , u n s i g n e d l o n g n , c o n s t d o u b l e * r e s t r i c t a , c o n s t d o u b l e * r e s t r i c t x , c o n s t d o u b l e * r e s t r i c t b ) { f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { t + = a [ r * n + i ] * x [ i ] ; } y [ r ] = t ; } }
Input source program
CONTROL FLOW GRAPH3/3
bp = getelementptr b, rt = load bpi = 0cmp = icmp lt, i, nbr cmp, B2, B3
B1
B2rn = mul r, nai = add rn, iap = getelementptr a, aiad = load apxp = getelementptr x, ixd = load xpax = mul ad, xdt = add t, axi = add i, 1cmp = icmp lt, i, nbr cmp, B2, B3
B3yp = getelementptr y, rstore t, ypr = add r, 1cmp = icmp lt, r, nbr cmp, B1, B4B4
ret void
B0 (ENTRY)
r = 0cmp = icmp lt r, mbr cmp, B1, B4
VARIABLE DEFINITION
bp = getelementptr b, rt = load bpi = 0cmp = icmp lt, i, nbr cmp, B2, B3
B1
B2rn = mul r, nai = add rn, iap = getelementptr a, aiad = load apxp = getelementptr x, ixd = load xpax = mul ad, xdt = add t, axi = add i, 1cmp = icmp lt, i, nbr cmp, B2, B3
B3yp = getelementptr y, rstore t, ypr = add r, 1cmp = icmp lt, r, nbr cmp, B1, B4B4
ret void
B0 (ENTRY)
r = 0cmp = icmp lt r, mbr cmp, B1, B4
definitions of r
The place where a variable is assigned or defined.
VARIABLE USE
bp = getelementptr b, rt = load bpi = 0cmp = icmp lt, i, nbr cmp, B2, B3
B1
B2rn = mul r, nai = add rn, iap = getelementptr a, aiad = load apxp = getelementptr x, ixd = load xpax = mul ad, xdt = add t, axi = add i, 1cmp = icmp lt, i, nbr cmp, B2, B3
B3yp = getelementptr y, rstore t, ypr = add r, 1cmp = icmp lt, r, nbr cmp, B1, B4B4
ret void
B0 (ENTRY)
r = 0cmp = icmp lt r, mbr cmp, B1, B4
definitions of ruses of r
The places where a variable is referred or used.
REACHING DEFINITION1/2
bp = getelementptr b, rt = load bpi = 0cmp = icmp lt, i, nbr cmp, B2, B3
B1
B2rn = mul r, nai = add rn, iap = getelementptr a, aiad = load apxp = getelementptr x, ixd = load xpax = mul ad, xdt = add t, axi = add i, 1cmp = icmp lt, i, nbr cmp, B2, B3
B3yp = getelementptr y, rstore t, ypr = add r, 1cmp = icmp lt, r, nbr cmp, B1, B4B4
ret void
B0 (ENTRY)
r = 0cmp = icmp lt r, mbr cmp, B1, B4
{d0} {d0, d1}
{d0, d1}
d0:
d1:
{d1}
{d0, d1}
{d0, d1}
Definitions that reaches a use.
REACHING DEFINITION2/2
Constant propagation is a good example to show theusefulness of reaching definition.
v o i d t e s t ( i n t c o n d ) { i n t a = 1 ; / / d 0 i n t b = 2 ; / / d 1 i f ( c o n d ) { c = 3 ; / / d 2 } e l s e { / / R e a c h D e f [ a ] = { d 0 } / / R e a c h D e f [ b ] = { d 1 } c = a + b ; / / d 3 }
/ / R e a c h D e f [ c ] = { d 2 , d 3 } u s e ( c ) ; }
DOMINANCE RELATION1/2
A basic block s dominates t iff every paths that goes fromentry to t will pass through s.
Every basic block in a CFG has an immediate dominator andforms a dominator tree.
DOMINANCE RELATION2/2
B0
B1
B2
B3
B4
B0
B1
B2B3
B4
Dominator Tree
DOMINANCE FRONTIERA basic block t is a dominance frontier of a basic block s, if
one of predecessor of t is dominated by s but t is not strictlydominated by s.
B0
B1 B2
B3
B6
B4B5
DF[B2] = {B3, B5}
DF[B4] = {B4}
DF[B1] = {B3, B5}
STATIC SINGLE-ASSIGNMENT FORMstatic — A static analysis to the program (not the execution.)
single-assignment — Every variable can only be assignedonce.
SSA form is the most popular intermediate representationrecently.
It is adopted by a wide range of compilers, such as GCC,LLVM, Java HotSpot, Android ART, etc.
SSA PROPERTIESEvery variables can only be defined once.
Every uses can only refer to one definition.
Use phi function to handle the merged control-flow.
PHI FUNCTIONS
d e f i n e v o i d @ f o o ( i 1 c o n d , i 3 2 a , i 3 2 b ) { e n t : b r c o n d , b 1 , b 2 b 1 : t 0 = m u l a , 4 b r b 3 b 2 : t 1 = m u l b , 5 b r b 3 b 3 : t 2 = p h i ( t 0 ) , ( t 1 ) u s e ( t 2 ) r e t }
d e f i n e v o i d @ f o o ( i 3 2 n ) { e n t : b r l o o p l o o p : i 0 = p h i ( 0 ) , ( i 1 ) c m p = i c m p g e i 0 , n b r c m p , e n d , c o n t c o n t : u s e ( i 0 ) i 1 = a d d i 0 , 1 b r l o o p e n d : r e t }
ADVANTAGE OF SSA FORMCompact — Reduce the def-use chain.
Referential transparency — The properties associated with avariable will not be changed, aka. context-free.
REDUCED DEF-USE CHAINv o i d f o o ( i n t c o n d 1 , i n t c o n d 2 , i n t a , i n t b ) { i n t t ; i f ( c o n d 1 ) { t = a * 4 ; / / d 0 } e l s e { t = b * 5 ; / / d 1 } i f ( c o n d 2 ) { / / r e a c h - d e f : { d 0 , d 1 } u s e ( t ) ; } e l s e { / / r e a c h - d e f : { d 0 , d 1 } u s e ( t ) ; } }
v o i d f o o ( i n t c o n d 1 , i n t c o n d 2 , i n t a , i n t b ) { i f ( c o n d 1 ) { t . 0 = a * 4 ; } e l s e { t . 1 = b * 5 ; } t . 2 = p h i ( t . 0 , t . 1 ) ; i f ( c o n d 2 ) { u s e ( t . 2 ) ; } e l s e { u s e ( t . 2 ) ; } }
REFERENTIAL TRANSPARENCY1/2
v o i d f o o ( ) { i n t r = 5 ; / / d 0
/ / . . . S O M E C O D E . . .
/ / W e c a n o n l y a s s u m e / / " r = = 5 " i f d 0 i s t h e / / o n l y r e a c h i n g d e f i n i t i o n . u s e ( r ) ; }
v o i d f o o ( ) { r . 0 = 5 ;
/ / . . . S O M E C O D E . . .
/ / N o m a t t e r w h a t c o d e a r e / / s k i p p e d a b o v e , i t i s s a f e / / t o r e p l a c e f o l l o w i n g r . 0 / / w i t h 5 . u s e ( r . 0 ) ; }
REFERENTIAL TRANSPARENCY2/2
v o i d f o o ( i n t a , i n t b ) { i n t c = a + b ; i n t r = a + b ; / / C a n w e s i m p l y r e p l a c e / / a l l o c c u r r e n c e o f r w i t h / / c ? ( N O )
/ / . . . S O M E C O D E . . .
u s e ( r ) ; }
v o i d f o o ( i n t a , i n t b ) { c . 0 = a + b ; r . 0 = a + b ;
/ / . . . S O M E C O D E . . .
/ / N o m a t t e r w h a t c o d e a r e / / s k i p p e d a b o v e , i t i s s a f e / / t o r e p l a c e f o l l o w i n g r . 0 / / w i t h c . 0 . u s e ( r . 0 ) ; }
BUILDING SSA FORM1. Compute domination relationships between basic blocks
and build the dominator tree.2. Compute dominator frontiers.3. Insert phi functions at dominator frontiers.4. Traverse the dominator tree and rename the variables.
BEFORE SSA CONSTRUCTION
bp = getelementptr b, rt = load bpi = 0cmp = icmp lt, i, nbr cmp, B2, B3
B1
B2rn = mul r, nai = add rn, iap = getelementptr a, aiad = load apxp = getelementptr x, ixd = load xpax = mul ad, xdt = add t, axi = add i, 1cmp = icmp lt, i, nbr cmp, B2, B3
B3yp = getelementptr y, rstore t, ypr = add r, 1cmp = icmp lt, r, nbr cmp, B1, B4B4
ret void
B0 (ENTRY)
r = 0cmp = icmp lt r, mbr cmp, B1, B4
AFTER SSA CONSTRUCTION
r.0 = phi (0, B0) (r.1 B3)bp = getelementptr b, r.0t.0 = load bpcmp = icmp lt, 0, nbr cmp, B2, B3
B1
B2t.1 = phi (t.0, B1) (t.2 B2)i.0 = phi (0, B1) (i.1, B2)rn = mul r.0, nai = add rn, i.1ap = getelementptr a, aiad = load apxp = getelementptr x, i.1xd = load xpax = mul ad, xdt.2 = add t.1, axi.1 = add i.0, 1cmp = icmp lt, i.1, nbr cmp, B2, B3
B3t.3 = phi (t0, B1) (t.2, B2)yp = getelementptr y, r.0store t.3, ypr.1 = add r.0, 1cmp = icmp lt, r.1, nbr cmp, B1, B4
B4ret void
B0 (ENTRY)
cmp = icmp lt 0, mbr cmp, B1, B4
OPTIMIZATIONS
CONSTANT PROPAGATION1/3
Constant propagation, sometimes known as constantfolding, will evaluate the instructions with constant
operands and propagate the constant result.a = a d d 2 , 3 b = a c = m u l a , b
a = 5 b = 5 c = 2 5
CONSTANT PROPAGATION2/3
Why do we need constant propagation?s t r u c t q u e u e * c r e a t e _ q u e u e ( ) { r e t u r n ( s t r u c t q u e u e * ) m a l l o c ( s i z e o f ( s t r u c t q u e u e * ) * 1 6 ) ; }
i n t p r o c e s s _ d a t a ( i n t a , i n t b , i n t c ) { i n t k [ ] = { 0 x 1 3 , 0 x 1 7 , 0 x 1 9 } ; i f ( D E B U G ) { v e r i f y _ d a t a ( k , d a t a ) ; } r e t u r n ( a * k [ 1 ] * k [ 2 ] + b * k [ 0 ] * k [ 2 ] + c * k [ 0 ] * k [ 1 ] ) ; }
CONSTANT PROPAGATION3/3
For each basic block b from CFG in reversed postorder:
For each instruction i in basic block b from top to bottom:
If all of its operands are constants, the operation has noside-effect, and we know how to evaluate it at compile time,then evaluate the result, remove the instruction, and replaceall usages with the result.
GLOBAL VALUE NUMBERING1/2
Global value numbering tries to give numbers to thecomputed expression and maps the newly visited expression
to the visited ones.
a = c * d ; / / [ ( * , c , d ) - > a ] e = c ; / / [ ( * , c , d ) - > a ] f = e * d ; / / q u e r y : i s ( * , c , d ) a v a i l a b l e ? u s e ( a , e , f ) ;
a = c * d ; u s e ( a , c , a ) ;
GLOBAL VALUE NUMBERING2/2
Traverse the basic blocks in depth-first order on dominatortree.
Maintain a stack of hash table. Once we have returned froma child node on the dominator tree, then we have to pop thestack top.
Visit the instructions in the basic block with tᅠ=ᅠopᅠa,ᅠbform and compute the hash for (op, a, b).
If (op, a, b) is already in the hash table, then change tᅠ=ᅠopᅠa,ᅠb with tᅠ=ᅠhash_tab[(op,ᅠa,ᅠb)]
Otherwise, insert (op, a, b) -> t to the hash table.
Otherwise, insert to the hash table.
DEAD CODE ELIMINATION1/6
Dead code elimination (DCE) removes unreachableinstructions or ignored results.
Constant propagation might reveal more dead code sincethe branch conditions become constant value.
On the other hand, DCE can exploit more constant forconstant propagation because several definitions are
removed from the program.
DEAD CODE ELIMINATION2/6
Conditional statements with constant conditioni f ( k I s D e b u g B u i l d ) { / / D e a d c o d e c h e c k _ i n v a r i a n t ( a , b , c , d ) ; / / D e a d c o d e }
DEAD CODE ELIMINATION3/6
Platform-specific constantv o i d h a s h _ e n t _ s e t _ v a l ( s t r u c t h a s h _ e n t * h , i n t v ) { i f ( s i z e o f ( i n t ) < = s i z e o f ( v o i d * ) ) { h - > p = ( v o i d * ) ( u i n t p t r _ t ) ( v ) ; } e l s e { h - > p = m a l l o c ( s i z e o f ( i n t ) ) ; / / D e a d c o d e * ( h - > p ) = v ; / / D e a d c o d e } }
DEAD CODE ELIMINATION4/6
Computed result ignoredi n t c o m p u t e _ s u m ( i n t a , i n t b ) { i n t s u m = ( a + b ) * ( a - b + 1 ) / 2 ; / / D e a d c o d e : N o t u s e d r e t u r n 0 ; }
Dead code a�er dead store optimizationv o i d t e s t ( i n t * p , b o o l c o n d , i n t a , i n t b , i n t c ) { i f ( c o n d ) { t = a + b ; / / D e a d c o d e * p = t ; / / W i l l b e r e m o v e d b y D S E } * p = c ; }
DEAD CODE ELIMINATION5/6
Dead code a�er code specializationv o i d m a t m u l ( d o u b l e * r e s t r i c t y , u n s i g n e d l o n g m , u n s i g n e d l o n g n , c o n s t d o u b l e * r e s t r i c t a , c o n s t d o u b l e * r e s t r i c t x , c o n s t d o u b l e * r e s t r i c t b ) { i f ( n = = 0 ) { f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { / / D e a d c o d e t + = a [ r * n + i ] * x [ i ] ; / / D e a d c o d e } / / D e a d c o d e y [ r ] = t ; } } e l s e { / / . . . s k i p p e d . . . } }
DEAD CODE ELIMINATION6/6
Traverse the CFG starting from entry in reversed post order.
Only traverse the successor that may be visited, i.e. if thebranch condition is a constant, then ignore the other side.
While traversing the basic block, clear the dead instructionswithin the basic block.
A�er the traversal, remove the unvisited basic blocks andremove the variable uses that refers to the variables that aredefined in the unvisited basic blocks.
LOOP OPTIMIZATIONSIt is reasonable to assume a program spends more time in
the loop body. Thus, loop optimization is an important issuein the compiler.
LOOP INVARIANT CODE MOTION1/3
Loop invariant code motion (LICM) is an optimization whichmoves loop invariants or loop constants out of the loop.i n t t e s t ( i n t n , i n t a , i n t b , i n t c ) { i n t s u m = 0 ; f o r ( i n t i = 0 ; i < n ; + + i ) { s u m + = i * a * b * c ; / / a * b * c i s l o o p i n v a r i a n t } r e t u r n s u m ;}
LOOP INVARIANT CODE MOTION2/3
f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { t + = a [ ( r * n ) + i ] * x [ i ] ; / / " r * n " i s a i n n e r l o o p i n v a r i a n t } y [ r ] = t ; }
f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; u n s i g n e d l o n g k = r * n ; / / " r * n " m o v e d o u t o f t h e i n n e r l o o p f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { t + = a [ k + i ] * x [ i ] ; } y [ r ] = t ; }
LOOP INVARIANT CODE MOTION3/3
How do we know whether a variable is a loop invariant?
If the computation of a value is not (transitively) dependingon following black lists, we can assume such value is a loopinvariant:
Phi instructions associating with the loop beingconsideredInstructions with side-effects or non-pure instructions,e.g. function calls
INDUCTION VARIABLEIf x represents the trip count (iteration count), then we call
a*x+b induction variables.f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; u n s i g n e d l o n g k = r * n ; / / o u t e r l o o p I V f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { u n s i g n e d l o n g m = k + i ; / / i n n e r l o o p I V t + = a [ m ] * x [ i ] ; } y [ r ] = t ; }
k and r are induction variables of outer loop.
i and m are induction variables of inner loop.
LOOP STRENGTH REDUCTION1/2
Strength reduction is an optimization to replace themultiplication in induction variables with an addition.f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; u n s i g n e d l o n g k = r * n ; / / o u t e r l o o p I V f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { u n s i g n e d l o n g m = k + i ; / / i n n e r l o o p I V t + = a [ m ] * x [ i ] ; } y [ r ] = t ; }
f o r ( u n s i g n e d l o n g r = 0 , k = 0 ; r < m ; + + r , k + = n ) { / / k d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 , m = k ; i < n ; + + i , + + m ) { / / m t + = a [ m ] * x [ i ] ; } y [ r ] = t ; }
LOOP STRENGTH REDUCTION2/2
We can even rewrite the range to eliminate the indexcomputation.
i n t s u m ( c o n s t i n t * a , i n t n ) { i n t r e s = 0 ; f o r ( i n t i = 0 ; i < n ; + + i ) { r e s + = a [ i ] ; / / i m p l i c i t * ( a + s i z e o f ( i n t ) * i ) } r e t u r n r e s ;}
i n t s u m ( c o n s t i n t * a , i n t n ) { i n t r e s = 0 ; c o n s t i n t * e n d = a + n ; / / i m p l i c i t a + s i z e o f ( i n t ) * n f o r ( c o n s t i n t * p = a ; p ! = e n d ; + + p ) { / / r a n g e u p d a t e d r e s + = * p ; } r e t u r n r e s ;}
LOOP UNROLLING1/2
Unroll the loop body multiple times.
Purpose: Reduce amortized loop iteration overhead.
Purpose: Reduce load/store stall and exploit instruction-level parallelism, e.g. so�ware pipelining.
Purpose: Prepare for vectorization, e.g. SIMD.
LOOP UNROLLING2/2
f o r ( u n s i g n e d l o n g r = 0 , k = 0 ; r < m ; + + r , k + = n ) { d o u b l e t = b [ r ] ; s w i t c h ( n & 0 x 3 ) { / / D u f f ' s d e v i c e c a s e 3 : t + = a [ k + 2 ] * x [ 2 ] ; c a s e 2 : t + = a [ k + 1 ] * x [ 1 ] ; c a s e 1 : t + = a [ k ] * x [ 0 ] ; } f o r ( u n s i g n e d l o n g i = n & 0 x 3 ; i < n ; i + = 4 ) { t + = a [ k + i ] * x [ i ] ; t + = a [ k + i + 1 ] * x [ i + 1 ] ; t + = a [ k + i + 2 ] * x [ i + 2 ] ; t + = a [ k + i + 3 ] * x [ i + 3 ] ; } y [ r ] = t ; }
INSTRUCTION SELECTION1/5
Instruction selection is a process to map IR into machineinstructions.
Compiler back-ends will perform pattern matching to thebest select instructions (according to the heuristic.)
INSTRUCTION SELECTION2/5
Complex operations, e.g. shi�-and-add or multiply-accumulate.
Array load/store instructions, which can be translated to oneshi�-add-and-load on some architectures.
IR instructions are which not natively supported by the targetmachine.
INSTRUCTION SELECTION3/5
d e f i n e i 6 4 @ m a c ( i 6 4 % a , i 6 4 % b , i 6 4 % c ) { e n t : % 0 = m u l i 6 4 % a , % b % 1 = a d d i 6 4 % 0 , % c r e t i 6 4 % 1 }
m a c : m a d d x 0 , x 1 , x 0 , x 2 r e t
INSTRUCTION SELECTION4/5
d e f i n e i 6 4 @ l o a d _ s h i f t ( i 6 4 * % a , i 6 4 % i ) { e n t : % 0 = g e t e l e m e n t p t r i 6 4 , i 6 4 * % a , i 6 4 % i % 1 = l o a d i 6 4 , i 6 4 * % 0 r e t i 6 4 % 1 }
l o a d _ s h i f t : l d r x 0 , [ x 0 , x 1 , l s l # 3 ] r e t
INSTRUCTION SELECTION5/5
% s t r u c t . A = t y p e { i 6 4 * , [ 1 6 x i 6 4 ] }
d e f i n e i 6 4 @ g e t _ v a l ( % s t r u c t . A * % p , i 6 4 % i , i 6 4 % j ) { e n t : % 0 = g e t e l e m e n t p t r % s t r u c t . A , % s t r u c t . A * % p , i 6 4 % i , i 3 2 1 , i 6 4 % j % 1 = l o a d i 6 4 , i 6 4 * % 0 r e t i 6 4 % 1 }
g e t _ v a l : m o v z w 8 , # 0 x 8 8 m a d d x 8 , x 1 , x 8 , x 0 a d d x 8 , x 8 , x 2 , l s l # 3 l d r x 0 , [ x 8 , # 8 ] r e t
SSA DESTRUCTION1/5
How do we deal with the phi functions?
Copy the assignment operator to the end of the predecessor.
(Must be done carefully)
SSA DESTRUCTION2/3
cmp = icmp eq, cond, 0br cmp, B1, B2
B0
br B3B1
br B3B2
a = phi (3, B1) (7, B2)print(a)
B3
cmp = icmp eq, cond, 0br cmp, B1, B2
B0
a = 3br B3
B1a = 7br B3
B2
print(a)
B3converts to
Example to show the assignment copying
SSA DESTRUCTION3/5
i.0 = phi (0, B0) (i.1, B2)cmp = icmp lt, i.0, 100br cmp, B2, B3
B1retB3
print(i.0)i.1 = add i.0, 1br B1
B2
br B1B0
cmp = icmp lt, i.0, 100br cmp, B2, B3
B1 B3
print(i.0)i.1 = add i.0, 1i.0 = i.1br B1
B2
i.0 = 0br B1
B0
ret
Converts to
Converts to
Example to show the assignment copying with loop
SSA DESTRUCTION4/5
i.0 = phi (0, B0) (i.1, B2)cmp = icmp lt, i.0, 100br cmp, B2, B3
B1print(i.0)ret
B3
print(i.0)i.1 = add i.0, 1br B1
B2
br B1B0
cmp = icmp lt, i.0, 100br cmp, B2, B3
B1 B3
print(i.0)i.1 = add i.0, 1i.0 = i.1br B1
B2
i.0 = 0br B1
B0
print(i.0)ret
Doesn'twork!
Lost copy problem — Naive de-SSA algorithm doesn't workdue to live range conflicts. Need extra assignments and
renaming to avoid conflicts.
SSA DESTRUCTION5/5
t = xx = yy = tbr cmp, B2, B1
B1
print(x)print(y)
B2
x = 1y = 2br B1
B0
x1 = phi (x0, y1)y1 = phi (y0, x1)br cmp, B2, B1
B1
print(x1)print(y1)
B2
x0 = 1y0 = 2br B1
B0
x1 = y1y1 = x1br cmp, B2, B1
B1
print(x1)print(y1)
B2
x0 = 1y0 = 2x1 = x0y1= y0br B1
B0
Swap problem — Conflicts due to parallel assignmentsemantics of phi instructions. A correct algorithm shoulddetect the cycle and implement parallel assignment with
swap instructions.
REGISTER ALLOCATION1/2
We have to replace infinite virtual registers with finitemachine registers.
Register Allocation — Make sure that the maximumsimultaneous register usages are less then k.
Register Assignment — Assign a register given the fact thatregisters are always sufficient.
The classical solution is to compute the life time of eachvariables, build the inference graph, spill the variable to
stack if the inference graph is not k-colorable.
REGISTER ALLOCATION2/2
ret yadd y, c, gadd g a fadd f, d, eadd e, a, cadd d, a, bmov c, #2mov b, #1mov a, #1
abcdefgy
a
g
bc
ed
fy
INSTRUCTION SCHEDULING1/2
Sort the instruction according the number of cycles in orderto reduce execution time of the critical path.
Constraints: (a) Data dependence, (b) Functional units, (c)Issue width, (d) Datapath forwarding, (e) Instruction cycle
time, etc.
INSTRUCTION SCHEDULING2/2
0 : a d d r 0 , r 0 , r 1 1 : a d d r 1 , r 2 , r 2 2 : l d r r 4 , [ r 3 ] 3 : s u b r 4 , r 4 , r 0 4 : s t r r 4 , [ r 1 ]
2 : l d r r 4 , [ r 3 ] # l o a d i n s t r u c t i o n n e e d s m o r e c y c l e s 0 : a d d r 0 , r 0 , r 1 1 : a d d r 1 , r 2 , r 2 3 : s u b r 4 , r 4 , r 0 4 : s t r r 4 , [ r 1 ]
COMPILER AND ICTINDUSTRY
WHERE CAN WE FIND COMPILER?
SMART PHONESAndroid ART (Java VM)OpenGL|ES GLSL Compiler (Graphics)
BROWSERSJavascript-related
Javascript Engine, e.g. V8, IonMonkey, JavaScriptCore,ChakraRegular Expression EngineWebAssembly
WebGL — OpenGL binding for the Web (Graphics)WebCL — OpenCL binding for the Web (Computing)
DESKTOP APPLICATION RUN-TIMEENVIRONMENTS
Java run-time (Java, Scala, Clojure, Groovy).NET run-time (C#, F#)RubyPythonPHP
VIRTUAL MACHINES AND EMULATORSSimulate different architecture (using dynamicrecompilation), e.g. QEMUSimulate special instructions that are not supported byhypervisor.
DEVELOPMENT ENVIRONMENTC/C++ compiler, e.g. MSVC, GCC, LLVMJava compiler, e.g. OpenJDKDSP compilers for digital signal processors.VHDL compilers for IC design team.Profiling and/or intrusmentation tools, e.g. Valgrind, Dr.Memory, Address Sanitizer, Memory Sanitizer
WHAT DOES A COMPILER TEAM DO?Develop the compiler toolchain for the in-house processors,
e.g. DSP, GPU, CPU, and etc.
Tune the performance of the existing compiler or virtualmachines.
Develop a new programming language or model (e.g. Cudafrom NVIDIA.)
SOME RESEARCH TOPICS ...Memory model for concurrency — Designing a good
memory model at programming language level, whichshould be intuitive to human while open for
compiler/architecture optimization, is still a challengingproblem.
Both Java and C++ took efforts to standardize. However,there are still some unintuitive cases that are allowed and
some desirable cases being ruled out.
THE ENDQ & A