Introduction to Compiler Development

INTRO TO COMPILERDEVELOPMENT

LOGAN CHIENhttp://slide.logan.tw/compiler-intro/

http://logan.tw/

http://slide.logan.tw/compiler-intro/

LOGAN CHIENSo�ware engineer at MediaTek

Ametuar LLVM/Clang developer

Integrated LLVM/Clang into Android NDK

WHY I LOVE COMPILERS?I have been a faithful reader of Jserv's blog for ten years.

I was inspired by the compiler and the virtual machinetechnologies mentioned in his blog.

I have decided to choose compiler technologies as myresearch topic since then.

UNDERGRADUATE COMPILERCOURSE

I took the undergraduate compiler course when I was asophomore.

MY PROFESSOR SAID ...“We took many lectures to discuss about the parser.

However, when people say they are doing compilerresearch, with large possibility, they are not referring to the

parsing technique.”

I AM HERE TO ...Re-introduce the compiler technologies,

Give a lightening talk on industrial-strength compilerdesign,

Explain the connection between compiler technologies andthe industry.

AGENDARe-introduction to Compiler (30min)

Industrial-strength Compiler Design (90min)

Compiler and ICT Industry (20min)

RE-INTRODUCTION TOCOMPILER

WHAT IS A COMPILER?Compiler are tools for programmers to translate

programmer's thought into computer runnable programs.

ANALOGY — Translators who turn from one language toanother, such as those who translate Chinese to English.

WHAT HAVE WE LEARNED INUNDERGRADUATE COMPILER

COURSE?

LEXERReads the input source code (as a sequence of bytes) and

converts them into a stream of tokens.u n s i g n e d b a c k g r o u n d ( u n s i g n e d f o r e g r o u n d ) { i f ( ( f o r e g r o u n d > > 1 6 ) > 0 x 8 0 ) { r e t u r n 0 ; } e l s e { r e t u r n 0 x f f f f f f ; } }

unsigned background ( unsigned foreground ) { if ( ( foreground >>16 ) > 128 ) { return 0 ; } else { return 16777215 ; } }

PARSERReads the tokens and build an AST according to the syntax.

unsigned background ( unsigned foreground ) { if ( ( foreground >>16 ) > 128 ) { return 0 ; } else { return 16777215 ; } }( p r o c e d u r e b a c k g r o u n d ( a r g s ' ( f o r e g r o u n d ) ) ( c o m p o u n d - s t m t ( i f - s t m t ( b i n - e x p r G E ( b i n - e x p r R S H I F T f o r e g r o u n d 1 6 ) 1 2 8 ) ( r e t u r n - s t m t 0 ) ( r e t u r n - s t m t 1 6 7 7 7 2 1 5 ) ) ) )

CODE GENERATORGenerate the machine code or (assembly) according to the

AST. In the undergraduate course, we usually simply dosyntax-directed translation.

( p r o c e d u r e b a c k g r o u n d ( a r g s ' ( f o r e g r o u n d ) ) ( c o m p o u n d - s t m t ( i f - s t m t ( b i n - e x p r G E ( b i n - e x p r R S H I F T f o r e g r o u n d 1 6 ) 1 2 8 ) ( r e t u r n - s t m t 0 ) ( r e t u r n - s t m t 1 6 7 7 7 2 1 5 ) ) ) )

l s r w 8 , w 0 , # 1 6 c m p w 8 , # 1 2 8 b . l o . L e l s e m o v w 0 , w z r r e t . L e l s e : o r r w 0 , w z r , # 0 x f f f f f f r e t

WHAT'S MISSING?Can a person who can only lex and parse sentences translate

articles well?

COMPILER REQUIREMENTSA compiler should translate the source code precisely.

A compiler should utilize the device efficiently.

THREE RELATED FIELDSProgramming Language

Computer Architecture

Compiler

PROGRAMMING LANGUAGE THEORYEssential component of a programming language: type

theory, variable scoping, language semantics, etc.

How do people reason and compose a program?

Create an abstraction that is understandable to human andtracable to computers.

EXAMPLE: SUBTYPE AND MUTABLERECORDS

Why you can't perform following conversion in C++?v o i d t e s t ( i n t * p t r ) { i n t * * p = & p t r ; c o n s t i n t * * a = p ; / / C o m p i l e r g i v e s w a r n i n g / / . . . }

This is related to covariant type and contravarience type. With PLT, we know that we can only choosetwo of (a) covariant type, (b) mutable records, and (c) type consistency.

v o i d t e s t ( i n t * p t r ) { c o n s t i n t c = 0 ; i n t * * p = & p t r ; c o n s t i n t * * a = p ; / / I f i t i s a l l o w e d , b a d p r o g r a m s w i l l p a s s . * a = & c ; * p = 5 ; / / N o w a r n i n g h e r e .}

EXAMPLE: DYNAMIC SCOPING INBASH

# ! / b i n / s h v = 1 # I n i t i a l i z e v w i t h 1 f o o ( ) { e c h o " f o o : v = $ { v } " # W h i c h v i s r e f e r r e d ? v = 2 # W h i c h v i s a s s i g n e d ? } b a r ( ) { l o c a l v = 3 f o o e c h o " b a r : v = $ { v } " # W h a t w i l l b e p r i n t e d ? } v = 4 # A s s i g n 4 t o v b a r e c h o " v = $ { v } " # W h a t w i l l b e p r i n t e d ?

Ans: foo:v=3 , bar:v=2 , v=4 . Surprisingly, foo isaccessing local v in bar instead of the global v.

EXAMPLE: NON-LEXICAL SCOPING INJAVASCRIPT

/ / J a v a s c r i p t , t h e b a d p a r t f u n c t i o n b a d ( v ) { v a r s u m ; w i t h ( v ) { s u m = a + b ; } r e t u r n s u m ;}

c o n s o l e . l o g ( b a d ( { a : 5 , b : 1 0 } ) ) ; c o n s o l e . l o g ( b a d ( { a : 5 , b : 1 0 , s u m : 1 0 0 } ) ) ;

Ans: The second console.log() prints undefined.

COMPUTER ARCHITECTUREInstruction set architecture: CISC vs. RISC.

Out-of-Order Execution vs. Instruction Scheduling.

Memory hierarchy

Memory model

QUIZ: DO YOU REALLY KNOW C?Is it guaranteed that v will always be loaded a�er pred?i n t p r e d ; i n t v ;

i n t g e t ( i n t a , i n t b ) { i n t r e s ; i f ( p r e d > 0 ) { r e s = v * a - v / b ; } e l s e { r e s = v * a + v / b ; } r e t u r n r e s ;}

Ans: No. Independent reads/writes can be reordered. The standard only requires the result shouldbe the same as running from top to bottom (in a single thread.)

COMPILER ANALYSISData-flow analysis — Analyze value ranges, check theconditions or contraints, figure out modifications tovariables, etc.

Control-flow analysis — Analyze the structure of theprogram, such as control dependency and loop structure.

Memory dependency analysis — Analyze the memoryaccess pattern of the access to array elements or pointerdereferences, e.g. alias analysis.

ALIAS ANALYSISDetermine whether two pointers can refers to the same

object or not.v o i d m o v e ( c h a r * d s t , c o n s t c h a r * s r c , i n t n ) { f o r ( i n t i = 0 ; i < n ; + + i ) { d s t [ i ] = s r c [ i ] ; } }

i n t s u m ( i n t * p t r , c o n s t i n t * v a l , i n t n ) { i n t r e s = 0 ; f o r ( i n t i = 0 ; i < n ; + + i ) { r e s + = * v a l ; * p t r + + = 1 0 ; } r e t u r n r e s ;}

QUIZ: DO YOU KNOW C++?c l a s s Q M u t e x L o c k e r { p u b l i c : u n i o n { Q M u t e x * m t x _ ; u i n t p t r _ t v a l _ ; } ; v o i d u n l o c k ( ) { i f ( v a l _ ) { i f ( ( v a l _ & ( u i n t p t r _ t ) 1 ) = = ( u i n t p t r _ t ) 1 ) { v a l _ & = ~ ( u i n t p t r _ t ) 1 ; m t x _ - > u n l o c k ( ) ; } } } } ;

Pitfall: Reading from union fields that were not writtenpreviously results in undefined behavior. Type-Based Alias

Analysis (TBAA) exploits this rule.

COMPILER OPTIMIZATIONScalar optimization — Fold the constants, remove theredundancies, change expressions with identities, etc.

Vector optimization — Convert several scalar operationsinto one vector operation, e.g. combining for add instructioninto one vector add.

Interprocedural optimization — function inlining,devirtualization, cross-function analysis, etc.

OTHER COMPILER-RELATEDTECHNOLOGYJust-in-time compilers

Binary translators

Program profiling and performance measurement

Facilities to run compiled executables, e.g. garbagecollectors

INDUSTRIAL-STRENGTH COMPILER

DESIGN

What's the difference between yourfinal project and the industrial-

strength compiler?

KEY DIFFERENCEAnalysis — Reasons program structures and changes ofvalues.

Optimization — Applies several provably correcttransformation which should make program run faster.

Intermediate Representation — Data structure on whichanalyses and optimizations are based.

INTERMEDIATE REPRESENTATION1/4

A data structure for program analyses and optimizations.

High-level enough to capture important properties andencapsulates hardware limitation.

Low-level enough to be analyzed by analyses andmanipulated by transformations.

An abstraction layer for multiple front-ends and back-ends.


C/C++

Ada

Fortran

Go

arm

aarch64

mips

nds

x86

GENERIC RTLGIMPLE

Tree SSA

GCC Compiler Pipeline


C/C++

Swift

Rust

Obj‑C

arm

aarch64

mips

sparc

x86

MachInstrSelDAGLLVM IR

LLVM Compiler Pipeline


GCC — GENERIC, GIMPLE, Tree SSA, and RTL.

LLVM — LLVM IR, Selection DAG, and Machine Instructions.

Java HotSpot — HIR, LIR, and MIR.

CONTROL FLOW GRAPH1/3

Basic block — A sequence of instructions that will only beentered from the top and exited from the end.

Edge — If the basic block s may branch to t, then we add adirected edge (s, t).

Predecessor/Successor — If there is an edge (s, t), then s is apredecessor of t and t is a successor of s.


/ / y = a * x + b ; v o i d m a t m u l ( d o u b l e * r e s t r i c t y , u n s i g n e d l o n g m , u n s i g n e d l o n g n , c o n s t d o u b l e * r e s t r i c t a , c o n s t d o u b l e * r e s t r i c t x , c o n s t d o u b l e * r e s t r i c t b ) { f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { t + = a [ r * n + i ] * x [ i ] ; } y [ r ] = t ; } }

Input source program


bp = getelementptr b, rt = load bpi = 0cmp = icmp lt, i, nbr cmp, B2, B3

B1

B2rn = mul r, nai = add rn, iap = getelementptr a, aiad = load apxp = getelementptr x, ixd = load xpax = mul ad, xdt = add t, axi = add i, 1cmp = icmp lt, i, nbr cmp, B2, B3

B3yp = getelementptr y, rstore t, ypr = add r, 1cmp = icmp lt, r, nbr cmp, B1, B4B4

ret void

B0 (ENTRY)

r = 0cmp = icmp lt r, mbr cmp, B1, B4

VARIABLE DEFINITION


B1



ret void

B0 (ENTRY)


definitions of r

The place where a variable is assigned or defined.

VARIABLE USE


B1



ret void

B0 (ENTRY)


definitions of ruses of r

The places where a variable is referred or used.

REACHING DEFINITION1/2


B1



ret void

B0 (ENTRY)


{d0} {d0, d1}

{d0, d1}

d0:

d1:

{d1}

{d0, d1}

{d0, d1}

Definitions that reaches a use.

REACHING DEFINITION2/2

Constant propagation is a good example to show theusefulness of reaching definition.

v o i d t e s t ( i n t c o n d ) { i n t a = 1 ; / / d 0 i n t b = 2 ; / / d 1 i f ( c o n d ) { c = 3 ; / / d 2 } e l s e { / / R e a c h D e f [ a ] = { d 0 } / / R e a c h D e f [ b ] = { d 1 } c = a + b ; / / d 3 }

/ / R e a c h D e f [ c ] = { d 2 , d 3 } u s e ( c ) ; }

DOMINANCE RELATION1/2

A basic block s dominates t iff every paths that goes fromentry to t will pass through s.

Every basic block in a CFG has an immediate dominator andforms a dominator tree.

DOMINANCE RELATION2/2

B0

B1

B2

B3

B4

B0

B1

B2B3

B4

Dominator Tree

DOMINANCE FRONTIERA basic block t is a dominance frontier of a basic block s, if

one of predecessor of t is dominated by s but t is not strictlydominated by s.

B0

B1 B2

B3

B6

B4B5

DF[B2] = {B3, B5}

DF[B4] = {B4}

DF[B1] = {B3, B5}

STATIC SINGLE-ASSIGNMENT FORMstatic — A static analysis to the program (not the execution.)

single-assignment — Every variable can only be assignedonce.

SSA form is the most popular intermediate representationrecently.

It is adopted by a wide range of compilers, such as GCC,LLVM, Java HotSpot, Android ART, etc.

SSA PROPERTIESEvery variables can only be defined once.

Every uses can only refer to one definition.

Use phi function to handle the merged control-flow.

PHI FUNCTIONS

d e f i n e v o i d @ f o o ( i 1 c o n d , i 3 2 a , i 3 2 b ) { e n t : b r c o n d , b 1 , b 2 b 1 : t 0 = m u l a , 4 b r b 3 b 2 : t 1 = m u l b , 5 b r b 3 b 3 : t 2 = p h i ( t 0 ) , ( t 1 ) u s e ( t 2 ) r e t }

d e f i n e v o i d @ f o o ( i 3 2 n ) { e n t : b r l o o p l o o p : i 0 = p h i ( 0 ) , ( i 1 ) c m p = i c m p g e i 0 , n b r c m p , e n d , c o n t c o n t : u s e ( i 0 ) i 1 = a d d i 0 , 1 b r l o o p e n d : r e t }

ADVANTAGE OF SSA FORMCompact — Reduce the def-use chain.

Referential transparency — The properties associated with avariable will not be changed, aka. context-free.

REDUCED DEF-USE CHAINv o i d f o o ( i n t c o n d 1 , i n t c o n d 2 , i n t a , i n t b ) { i n t t ; i f ( c o n d 1 ) { t = a * 4 ; / / d 0 } e l s e { t = b * 5 ; / / d 1 } i f ( c o n d 2 ) { / / r e a c h - d e f : { d 0 , d 1 } u s e ( t ) ; } e l s e { / / r e a c h - d e f : { d 0 , d 1 } u s e ( t ) ; } }

v o i d f o o ( i n t c o n d 1 , i n t c o n d 2 , i n t a , i n t b ) { i f ( c o n d 1 ) { t . 0 = a * 4 ; } e l s e { t . 1 = b * 5 ; } t . 2 = p h i ( t . 0 , t . 1 ) ; i f ( c o n d 2 ) { u s e ( t . 2 ) ; } e l s e { u s e ( t . 2 ) ; } }

REFERENTIAL TRANSPARENCY1/2

v o i d f o o ( ) { i n t r = 5 ; / / d 0

/ / . . . S O M E C O D E . . .

/ / W e c a n o n l y a s s u m e / / " r = = 5 " i f d 0 i s t h e / / o n l y r e a c h i n g d e f i n i t i o n . u s e ( r ) ; }

v o i d f o o ( ) { r . 0 = 5 ;

/ / . . . S O M E C O D E . . .

/ / N o m a t t e r w h a t c o d e a r e / / s k i p p e d a b o v e , i t i s s a f e / / t o r e p l a c e f o l l o w i n g r . 0 / / w i t h 5 . u s e ( r . 0 ) ; }

REFERENTIAL TRANSPARENCY2/2

v o i d f o o ( i n t a , i n t b ) { i n t c = a + b ; i n t r = a + b ; / / C a n w e s i m p l y r e p l a c e / / a l l o c c u r r e n c e o f r w i t h / / c ? ( N O )

/ / . . . S O M E C O D E . . .

u s e ( r ) ; }

v o i d f o o ( i n t a , i n t b ) { c . 0 = a + b ; r . 0 = a + b ;

/ / . . . S O M E C O D E . . .

/ / N o m a t t e r w h a t c o d e a r e / / s k i p p e d a b o v e , i t i s s a f e / / t o r e p l a c e f o l l o w i n g r . 0 / / w i t h c . 0 . u s e ( r . 0 ) ; }

BUILDING SSA FORM1. Compute domination relationships between basic blocks

and build the dominator tree.2. Compute dominator frontiers.3. Insert phi functions at dominator frontiers.4. Traverse the dominator tree and rename the variables.

BEFORE SSA CONSTRUCTION


B1



ret void

B0 (ENTRY)


AFTER SSA CONSTRUCTION

r.0 = phi (0, B0) (r.1 B3)bp = getelementptr b, r.0t.0 = load bpcmp = icmp lt, 0, nbr cmp, B2, B3

B1

B2t.1 = phi (t.0, B1) (t.2 B2)i.0 = phi (0, B1) (i.1, B2)rn = mul r.0, nai = add rn, i.1ap = getelementptr a, aiad = load apxp = getelementptr x, i.1xd = load xpax = mul ad, xdt.2 = add t.1, axi.1 = add i.0, 1cmp = icmp lt, i.1, nbr cmp, B2, B3

B3t.3 = phi (t0, B1) (t.2, B2)yp = getelementptr y, r.0store t.3, ypr.1 = add r.0, 1cmp = icmp lt, r.1, nbr cmp, B1, B4

B4ret void

B0 (ENTRY)

cmp = icmp lt 0, mbr cmp, B1, B4

OPTIMIZATIONS

CONSTANT PROPAGATION1/3

Constant propagation, sometimes known as constantfolding, will evaluate the instructions with constant

operands and propagate the constant result.a = a d d 2 , 3 b = a c = m u l a , b

a = 5 b = 5 c = 2 5


Why do we need constant propagation?s t r u c t q u e u e * c r e a t e _ q u e u e ( ) { r e t u r n ( s t r u c t q u e u e * ) m a l l o c ( s i z e o f ( s t r u c t q u e u e * ) * 1 6 ) ; }

i n t p r o c e s s _ d a t a ( i n t a , i n t b , i n t c ) { i n t k [ ] = { 0 x 1 3 , 0 x 1 7 , 0 x 1 9 } ; i f ( D E B U G ) { v e r i f y _ d a t a ( k , d a t a ) ; } r e t u r n ( a * k [ 1 ] * k [ 2 ] + b * k [ 0 ] * k [ 2 ] + c * k [ 0 ] * k [ 1 ] ) ; }


For each basic block b from CFG in reversed postorder:

For each instruction i in basic block b from top to bottom:

If all of its operands are constants, the operation has noside-effect, and we know how to evaluate it at compile time,then evaluate the result, remove the instruction, and replaceall usages with the result.

GLOBAL VALUE NUMBERING1/2

Global value numbering tries to give numbers to thecomputed expression and maps the newly visited expression

to the visited ones.

a = c * d ; / / [ ( * , c , d ) - > a ] e = c ; / / [ ( * , c , d ) - > a ] f = e * d ; / / q u e r y : i s ( * , c , d ) a v a i l a b l e ? u s e ( a , e , f ) ;

a = c * d ; u s e ( a , c , a ) ;

GLOBAL VALUE NUMBERING2/2

Traverse the basic blocks in depth-first order on dominatortree.

Maintain a stack of hash table. Once we have returned froma child node on the dominator tree, then we have to pop thestack top.

Visit the instructions in the basic block with tﾠ=ﾠopﾠa,ﾠbform and compute the hash for (op, a, b).

If (op, a, b) is already in the hash table, then change tﾠ=ﾠopﾠa,ﾠb with tﾠ=ﾠhash_tab[(op,ﾠa,ﾠb)]

Otherwise, insert (op, a, b) -> t to the hash table.

Otherwise, insert to the hash table.

DEAD CODE ELIMINATION1/6

Dead code elimination (DCE) removes unreachableinstructions or ignored results.

Constant propagation might reveal more dead code sincethe branch conditions become constant value.

On the other hand, DCE can exploit more constant forconstant propagation because several definitions are

removed from the program.


Conditional statements with constant conditioni f ( k I s D e b u g B u i l d ) { / / D e a d c o d e c h e c k _ i n v a r i a n t ( a , b , c , d ) ; / / D e a d c o d e }


Platform-specific constantv o i d h a s h _ e n t _ s e t _ v a l ( s t r u c t h a s h _ e n t * h , i n t v ) { i f ( s i z e o f ( i n t ) < = s i z e o f ( v o i d * ) ) { h - > p = ( v o i d * ) ( u i n t p t r _ t ) ( v ) ; } e l s e { h - > p = m a l l o c ( s i z e o f ( i n t ) ) ; / / D e a d c o d e * ( h - > p ) = v ; / / D e a d c o d e } }


Computed result ignoredi n t c o m p u t e _ s u m ( i n t a , i n t b ) { i n t s u m = ( a + b ) * ( a - b + 1 ) / 2 ; / / D e a d c o d e : N o t u s e d r e t u r n 0 ; }

Dead code a�er dead store optimizationv o i d t e s t ( i n t * p , b o o l c o n d , i n t a , i n t b , i n t c ) { i f ( c o n d ) { t = a + b ; / / D e a d c o d e * p = t ; / / W i l l b e r e m o v e d b y D S E } * p = c ; }


Dead code a�er code specializationv o i d m a t m u l ( d o u b l e * r e s t r i c t y , u n s i g n e d l o n g m , u n s i g n e d l o n g n , c o n s t d o u b l e * r e s t r i c t a , c o n s t d o u b l e * r e s t r i c t x , c o n s t d o u b l e * r e s t r i c t b ) { i f ( n = = 0 ) { f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { / / D e a d c o d e t + = a [ r * n + i ] * x [ i ] ; / / D e a d c o d e } / / D e a d c o d e y [ r ] = t ; } } e l s e { / / . . . s k i p p e d . . . } }


Traverse the CFG starting from entry in reversed post order.

Only traverse the successor that may be visited, i.e. if thebranch condition is a constant, then ignore the other side.

While traversing the basic block, clear the dead instructionswithin the basic block.

A�er the traversal, remove the unvisited basic blocks andremove the variable uses that refers to the variables that aredefined in the unvisited basic blocks.

LOOP OPTIMIZATIONSIt is reasonable to assume a program spends more time in

the loop body. Thus, loop optimization is an important issuein the compiler.

LOOP INVARIANT CODE MOTION1/3

Loop invariant code motion (LICM) is an optimization whichmoves loop invariants or loop constants out of the loop.i n t t e s t ( i n t n , i n t a , i n t b , i n t c ) { i n t s u m = 0 ; f o r ( i n t i = 0 ; i < n ; + + i ) { s u m + = i * a * b * c ; / / a * b * c i s l o o p i n v a r i a n t } r e t u r n s u m ;}


f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { t + = a [ ( r * n ) + i ] * x [ i ] ; / / " r * n " i s a i n n e r l o o p i n v a r i a n t } y [ r ] = t ; }

f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; u n s i g n e d l o n g k = r * n ; / / " r * n " m o v e d o u t o f t h e i n n e r l o o p f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { t + = a [ k + i ] * x [ i ] ; } y [ r ] = t ; }


How do we know whether a variable is a loop invariant?

If the computation of a value is not (transitively) dependingon following black lists, we can assume such value is a loopinvariant:

Phi instructions associating with the loop beingconsideredInstructions with side-effects or non-pure instructions,e.g. function calls

INDUCTION VARIABLEIf x represents the trip count (iteration count), then we call

a*x+b induction variables.f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; u n s i g n e d l o n g k = r * n ; / / o u t e r l o o p I V f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { u n s i g n e d l o n g m = k + i ; / / i n n e r l o o p I V t + = a [ m ] * x [ i ] ; } y [ r ] = t ; }

k and r are induction variables of outer loop.

i and m are induction variables of inner loop.

LOOP STRENGTH REDUCTION1/2

Strength reduction is an optimization to replace themultiplication in induction variables with an addition.f o r ( u n s i g n e d l o n g r = 0 ; r < m ; + + r ) { d o u b l e t = b [ r ] ; u n s i g n e d l o n g k = r * n ; / / o u t e r l o o p I V f o r ( u n s i g n e d l o n g i = 0 ; i < n ; + + i ) { u n s i g n e d l o n g m = k + i ; / / i n n e r l o o p I V t + = a [ m ] * x [ i ] ; } y [ r ] = t ; }

f o r ( u n s i g n e d l o n g r = 0 , k = 0 ; r < m ; + + r , k + = n ) { / / k d o u b l e t = b [ r ] ; f o r ( u n s i g n e d l o n g i = 0 , m = k ; i < n ; + + i , + + m ) { / / m t + = a [ m ] * x [ i ] ; } y [ r ] = t ; }

LOOP STRENGTH REDUCTION2/2

We can even rewrite the range to eliminate the indexcomputation.

i n t s u m ( c o n s t i n t * a , i n t n ) { i n t r e s = 0 ; f o r ( i n t i = 0 ; i < n ; + + i ) { r e s + = a [ i ] ; / / i m p l i c i t * ( a + s i z e o f ( i n t ) * i ) } r e t u r n r e s ;}

i n t s u m ( c o n s t i n t * a , i n t n ) { i n t r e s = 0 ; c o n s t i n t * e n d = a + n ; / / i m p l i c i t a + s i z e o f ( i n t ) * n f o r ( c o n s t i n t * p = a ; p ! = e n d ; + + p ) { / / r a n g e u p d a t e d r e s + = * p ; } r e t u r n r e s ;}

LOOP UNROLLING1/2

Unroll the loop body multiple times.

Purpose: Reduce amortized loop iteration overhead.

Purpose: Reduce load/store stall and exploit instruction-level parallelism, e.g. so�ware pipelining.

Purpose: Prepare for vectorization, e.g. SIMD.

LOOP UNROLLING2/2

f o r ( u n s i g n e d l o n g r = 0 , k = 0 ; r < m ; + + r , k + = n ) { d o u b l e t = b [ r ] ; s w i t c h ( n & 0 x 3 ) { / / D u f f ' s d e v i c e c a s e 3 : t + = a [ k + 2 ] * x [ 2 ] ; c a s e 2 : t + = a [ k + 1 ] * x [ 1 ] ; c a s e 1 : t + = a [ k ] * x [ 0 ] ; } f o r ( u n s i g n e d l o n g i = n & 0 x 3 ; i < n ; i + = 4 ) { t + = a [ k + i ] * x [ i ] ; t + = a [ k + i + 1 ] * x [ i + 1 ] ; t + = a [ k + i + 2 ] * x [ i + 2 ] ; t + = a [ k + i + 3 ] * x [ i + 3 ] ; } y [ r ] = t ; }

INSTRUCTION SELECTION1/5

Instruction selection is a process to map IR into machineinstructions.

Compiler back-ends will perform pattern matching to thebest select instructions (according to the heuristic.)


Complex operations, e.g. shi�-and-add or multiply-accumulate.

Array load/store instructions, which can be translated to oneshi�-add-and-load on some architectures.

IR instructions are which not natively supported by the targetmachine.


d e f i n e i 6 4 @ m a c ( i 6 4 % a , i 6 4 % b , i 6 4 % c ) { e n t : % 0 = m u l i 6 4 % a , % b % 1 = a d d i 6 4 % 0 , % c r e t i 6 4 % 1 }

m a c : m a d d x 0 , x 1 , x 0 , x 2 r e t


d e f i n e i 6 4 @ l o a d _ s h i f t ( i 6 4 * % a , i 6 4 % i ) { e n t : % 0 = g e t e l e m e n t p t r i 6 4 , i 6 4 * % a , i 6 4 % i % 1 = l o a d i 6 4 , i 6 4 * % 0 r e t i 6 4 % 1 }

l o a d _ s h i f t : l d r x 0 , [ x 0 , x 1 , l s l # 3 ] r e t


% s t r u c t . A = t y p e { i 6 4 * , [ 1 6 x i 6 4 ] }

d e f i n e i 6 4 @ g e t _ v a l ( % s t r u c t . A * % p , i 6 4 % i , i 6 4 % j ) { e n t : % 0 = g e t e l e m e n t p t r % s t r u c t . A , % s t r u c t . A * % p , i 6 4 % i , i 3 2 1 , i 6 4 % j % 1 = l o a d i 6 4 , i 6 4 * % 0 r e t i 6 4 % 1 }

g e t _ v a l : m o v z w 8 , # 0 x 8 8 m a d d x 8 , x 1 , x 8 , x 0 a d d x 8 , x 8 , x 2 , l s l # 3 l d r x 0 , [ x 8 , # 8 ] r e t

SSA DESTRUCTION1/5

How do we deal with the phi functions?

Copy the assignment operator to the end of the predecessor.

(Must be done carefully)

SSA DESTRUCTION2/3

cmp = icmp eq, cond, 0br cmp, B1, B2

B0

br B3B1

br B3B2

a = phi (3, B1) (7, B2)print(a)

B3

cmp = icmp eq, cond, 0br cmp, B1, B2

B0

a = 3br B3

B1a = 7br B3

B2

print(a)

B3converts to

Example to show the assignment copying

SSA DESTRUCTION3/5

i.0 = phi (0, B0) (i.1, B2)cmp = icmp lt, i.0, 100br cmp, B2, B3

B1retB3

print(i.0)i.1 = add i.0, 1br B1

B2

br B1B0

cmp = icmp lt, i.0, 100br cmp, B2, B3

B1 B3

print(i.0)i.1 = add i.0, 1i.0 = i.1br B1

B2

i.0 = 0br B1

B0

ret

Converts to

Converts to

Example to show the assignment copying with loop

SSA DESTRUCTION4/5

i.0 = phi (0, B0) (i.1, B2)cmp = icmp lt, i.0, 100br cmp, B2, B3

B1print(i.0)ret

B3

print(i.0)i.1 = add i.0, 1br B1

B2

br B1B0

cmp = icmp lt, i.0, 100br cmp, B2, B3

B1 B3

print(i.0)i.1 = add i.0, 1i.0 = i.1br B1

B2

i.0 = 0br B1

B0

print(i.0)ret

Doesn'twork!

Lost copy problem — Naive de-SSA algorithm doesn't workdue to live range conflicts. Need extra assignments and

renaming to avoid conflicts.

SSA DESTRUCTION5/5

t = xx = yy = tbr cmp, B2, B1

B1

print(x)print(y)

B2

x = 1y = 2br B1

B0

x1 = phi (x0, y1)y1 = phi (y0, x1)br cmp, B2, B1

B1

print(x1)print(y1)

B2

x0 = 1y0 = 2br B1

B0

x1 = y1y1 = x1br cmp, B2, B1

B1

print(x1)print(y1)

B2

x0 = 1y0 = 2x1 = x0y1= y0br B1

B0

Swap problem — Conflicts due to parallel assignmentsemantics of phi instructions. A correct algorithm shoulddetect the cycle and implement parallel assignment with

swap instructions.

REGISTER ALLOCATION1/2

We have to replace infinite virtual registers with finitemachine registers.

Register Allocation — Make sure that the maximumsimultaneous register usages are less then k.

Register Assignment — Assign a register given the fact thatregisters are always sufficient.

The classical solution is to compute the life time of eachvariables, build the inference graph, spill the variable to

stack if the inference graph is not k-colorable.

REGISTER ALLOCATION2/2

ret yadd y, c, gadd g a fadd f, d, eadd e, a, cadd d, a, bmov c, #2mov b, #1mov a, #1

abcdefgy

a

g

bc

ed

fy

INSTRUCTION SCHEDULING1/2

Sort the instruction according the number of cycles in orderto reduce execution time of the critical path.

Constraints: (a) Data dependence, (b) Functional units, (c)Issue width, (d) Datapath forwarding, (e) Instruction cycle

time, etc.

INSTRUCTION SCHEDULING2/2

0 : a d d r 0 , r 0 , r 1 1 : a d d r 1 , r 2 , r 2 2 : l d r r 4 , [ r 3 ] 3 : s u b r 4 , r 4 , r 0 4 : s t r r 4 , [ r 1 ]

2 : l d r r 4 , [ r 3 ] # l o a d i n s t r u c t i o n n e e d s m o r e c y c l e s 0 : a d d r 0 , r 0 , r 1 1 : a d d r 1 , r 2 , r 2 3 : s u b r 4 , r 4 , r 0 4 : s t r r 4 , [ r 1 ]

COMPILER AND ICTINDUSTRY

WHERE CAN WE FIND COMPILER?

SMART PHONESAndroid ART (Java VM)OpenGL|ES GLSL Compiler (Graphics)

BROWSERSJavascript-related

Javascript Engine, e.g. V8, IonMonkey, JavaScriptCore,ChakraRegular Expression EngineWebAssembly

WebGL — OpenGL binding for the Web (Graphics)WebCL — OpenCL binding for the Web (Computing)

DESKTOP APPLICATION RUN-TIMEENVIRONMENTS

Java run-time (Java, Scala, Clojure, Groovy).NET run-time (C#, F#)RubyPythonPHP

VIRTUAL MACHINES AND EMULATORSSimulate different architecture (using dynamicrecompilation), e.g. QEMUSimulate special instructions that are not supported byhypervisor.

DEVELOPMENT ENVIRONMENTC/C++ compiler, e.g. MSVC, GCC, LLVMJava compiler, e.g. OpenJDKDSP compilers for digital signal processors.VHDL compilers for IC design team.Profiling and/or intrusmentation tools, e.g. Valgrind, Dr.Memory, Address Sanitizer, Memory Sanitizer

WHAT DOES A COMPILER TEAM DO?Develop the compiler toolchain for the in-house processors,

e.g. DSP, GPU, CPU, and etc.

Tune the performance of the existing compiler or virtualmachines.

Develop a new programming language or model (e.g. Cudafrom NVIDIA.)

SOME RESEARCH TOPICS ...Memory model for concurrency — Designing a good

memory model at programming language level, whichshould be intuitive to human while open for

compiler/architecture optimization, is still a challengingproblem.

Both Java and C++ took efforts to standardize. However,there are still some unintuitive cases that are allowed and

some desirable cases being ruled out.

THE ENDQ & A

Introduction to Compiler Development

Software

Transcript of Introduction to Compiler Development