CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”
description
Transcript of CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”
1
CS 201Computer Systems Programming
Chapter 3“Architecture Overview”
Herbert G. Mayer, PSU CSHerbert G. Mayer, PSU CSStatus 6/29/2014Status 6/29/2014
2
Syllabus Computing HistoryComputing History
Evolution of Microprocessor µP Evolution of Microprocessor µP PerformancePerformance
Processor Performance GrowthProcessor Performance Growth
Key Architecture MessagesKey Architecture Messages
Code Sequences for 3 Different Code Sequences for 3 Different ArchitecturesArchitectures
Dependencies, AKA DependencesDependencies, AKA Dependences
Score BoardScore Board
ReferencesReferences
3
Computing HistoryComputing HistoryLong, long before 1940s:Long, long before 1940s:1643 Pascal’s 1643 Pascal’s Arithmetic MachineArithmetic Machine
About 1660 Leibnitz About 1660 Leibnitz Four Function CalculatorFour Function Calculator
1710 -1750 1710 -1750 Punched CardsPunched Cards by Bouchon, Falcon, by Bouchon, Falcon, JacquardJacquard
1810 Babbage 1810 Babbage Difference EngineDifference Engine, unfinished; 1, unfinished; 1stst programmer ever in the world was programmer ever in the world was AdaAda, poet Lord , poet Lord Byron’s daughter, after whom the language Ada Byron’s daughter, after whom the language Ada was named: was named: Lady Ada LovelaceLady Ada Lovelace
1835 Babbage 1835 Babbage Analytical EngineAnalytical Engine, also unfinished, also unfinished
1920 Hollerith 1920 Hollerith Tabulating MachineTabulating Machine to help with to help with census in the USAcensus in the USA
4
Computing HistoryComputing HistoryDecade of 1940sDecade of 1940s1939 – 1942 1939 – 1942 John Atanasoff John Atanasoff built programmable, built programmable,
electronic computer at Iowa State Universityelectronic computer at Iowa State University
1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague mechanical computers based on relays; colleague advised use of “vacuum tubes”advised use of “vacuum tubes”
1946 1946 John von Neumann’s John von Neumann’s computer design of stored computer design of stored programprogram
1946 Mauchly and Eckert built 1946 Mauchly and Eckert built ENIACENIAC, modeled after , modeled after Atanasoff’s ideas, built at University of Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monsterComputer, 30 ton monster
1980s 1980s John Atanasoff John Atanasoff got acknowledgment and patent got acknowledgment and patent officially officially
5
Computing HistoryComputing HistoryDecade of the 1950sDecade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, Univac Uniprocessor based on ENIAC, commercially viable,
developed by developed by John Mauchly John Mauchly and John Presper Eckertand John Presper Eckert Commercial systems sold by Remington RandCommercial systems sold by Remington Rand Mark III computerMark III computer
Decade of the 1960s Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al.IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tubeTransistor replaces vacuum tube Burroughs stack machines, compete with GPR architecturesBurroughs stack machines, compete with GPR architectures All still All still von Neumannvon Neumann architectures architectures 1969 1969 ARPANETARPANET CacheCache and and VMMVMM developed, first at Manchester University developed, first at Manchester University
6
Computing HistoryComputing History
Decade of the 1970sDecade of the 1970sBirth of Microprocessor at Intel, Birth of Microprocessor at Intel,
see see Gordon MooreGordon Moore
High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 seriesseries
Architecture advances: Caches, Architecture advances: Caches, virtual virtual memories memories (VMM) ubiquitous, since (VMM) ubiquitous, since realreal memories were expensive memories were expensive
Intel 4004, Intel 8080, single-chip microprocessorsIntel 4004, Intel 8080, single-chip microprocessors
Programmable controllersProgrammable controllers
Mini-computers, PDP 11, HP 3000 16-bit computerMini-computers, PDP 11, HP 3000 16-bit computer
Height of Digital Equipment Corp. (DEC)Height of Digital Equipment Corp. (DEC)
Birth of Birth of personal computerspersonal computers, which DEC misses!, which DEC misses!
7
Computing HistoryComputing History
Decade of the 1980sDecade of the 1980s
decrease of mini-computer usedecrease of mini-computer use
32-bit computing even on minis32-bit computing even on minis
Architecture advances: superscalar, faster Architecture advances: superscalar, faster caches, larger cachescaches, larger caches
Multitude of Supercomputer manufacturersMultitude of Supercomputer manufacturers
Compiler complexity: trace-scheduling, VLIWCompiler complexity: trace-scheduling, VLIW
Workstations common: Apollo, HP, DEC’s Workstations common: Apollo, HP, DEC’s Ken Ken Olsen Olsen trying to catch up, Intergraph, trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Ardent, Sun, Three Rivers, Silicon Graphics, etc.Graphics, etc.
8
Computing HistoryComputing History
Decade of the 1990sDecade of the 1990s•Architecture advances: superscalar & Architecture advances: superscalar & pipelined, speculative execution, ooo pipelined, speculative execution, ooo executionexecution
•Powerful desktopsPowerful desktops
•End of mini-computer and of many super-End of mini-computer and of many super-computer manufacturerscomputer manufacturers
•Microprocessor powerful as early Microprocessor powerful as early supercomputerssupercomputers
•Consolidation of many computer companies into Consolidation of many computer companies into few larger onesfew larger ones
•End of USSR marked the demise of several End of USSR marked the demise of several supercomputer companiessupercomputer companies
9
Evolution of µP Performance(by: James C. Hoe @ CMU)
1970s 1980s 1990s 2000+ Transistor Count 10k-100k 100k-1M 1M-100M 1B
Clock Frequency 0.2-2 MHz 2-20 MHz 0.02 – 1 GHz 10 GHz
Instructions / cycle: ipc < 0.1 0.1 – 0.9 0.9 – 2.0 > 10 (?)
MIPs, FLOPs < 0.2 0.2 - 20 20 – 2,000 100,000
10
Processor Performance GrowthMoore’s Law --from Webopedia 8/27/2004:Moore’s Law --from Webopedia 8/27/2004:
““The observation made in 1965 by Gordon Moore, co-The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every square inch on integrated circuits had doubled every year since it was invented. Moore predicted that year since it was invented. Moore predicted that this trend would continue for the foreseeable this trend would continue for the foreseeable future.future.
In subsequent years, the pace slowed down a bit, but In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 monthsdata density doubled approximately every 18 months, , and this is the current definition of and this is the current definition of Moore's LawMoore's Law, , which which Moore himself has blessedMoore himself has blessed. Most experts, . Most experts, including Moore himself, expect including Moore himself, expect Moore's LawMoore's Law to hold to hold for another two decades.for another two decades.
Others coin a more general law, a bit lamely stating Others coin a more general law, a bit lamely stating that that “the circuit density increases predictably over “the circuit density increases predictably over time.”time.”
11
Processor Performance GrowthSo far in 2014, Moore’s Law is holding true since So far in 2014, Moore’s Law is holding true since
~1968.~1968.
Some Intel Some Intel fellows fellows believe that an end to Moore’s Law believe that an end to Moore’s Law will be reached ~2018 due to physical limitations will be reached ~2018 due to physical limitations in the process of manufacturing transistors from in the process of manufacturing transistors from semi-conductor material.semi-conductor material.
Such phenomenal growth is unknown in any other Such phenomenal growth is unknown in any other industry. For example, if doubling of performance industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 could be achieved every 18 months, then by 2001 other industries would have achieved the other industries would have achieved the following:following:
Cars would travel at 2,400,000 Mph, and get 600,000 Cars would travel at 2,400,000 Mph, and get 600,000 MpGMpG
Air travel LA to NYC would be at 36,000 Mach, take Air travel LA to NYC would be at 36,000 Mach, take 0.5 seconds0.5 seconds
12
KeyArchitecture
Messages
13
Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, The inner core of the processor, the CPU or the µP,
is getting faster at a steady rateis getting faster at a steady rate
Access to memoryAccess to memory is also getting faster over time, is also getting faster over time, but but at a slower rateat a slower rate. This rate differential has . This rate differential has existed for quite some time, with the strange existed for quite some time, with the strange effect that fast processors have to rely on effect that fast processors have to rely on progressively slower memories –relatively speakingprogressively slower memories –relatively speaking
Not uncommon on MP server that processor has to Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes; wait >100 cycles before a memory access completes; that is one single memory accessthat is one single memory access. On a Multi-. On a Multi-Processor the bus protocol is more complex due to Processor the bus protocol is more complex due to snooping, backing-off, arbitration, thus the number snooping, backing-off, arbitration, thus the number of cycles to complete a memory access can grow highof cycles to complete a memory access can grow high
IO simply compounds the problem of slow memory IO simply compounds the problem of slow memory accessaccess
14
Message 1: Memory is Slow Discarding conventional memory altogether, relying only Discarding conventional memory altogether, relying only
on cache-like memories, is NOT an option for 64-bit on cache-like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you architectures, due to the price/size/cost/power if you pursue full memory population with 2pursue full memory population with 26464 bytes bytes
Another way of seeing this: Using solely reasonably-Another way of seeing this: Using solely reasonably-priced cache memories (say at < 10 times the cost of priced cache memories (say at < 10 times the cost of regular memory) is not feasible: resulting physical regular memory) is not feasible: resulting physical address space would be too small, or price too highaddress space would be too small, or price too high
Significant intellectual efforts in computer Significant intellectual efforts in computer architecture focuses on architecture focuses on reducing the performance impact reducing the performance impact of fast processors accessing slow, virtualized memoriesof fast processors accessing slow, virtualized memories
All else except IO, seems easy compared to this All else except IO, seems easy compared to this fundamental problem!fundamental problem!
IO is even slower by further orders of magnitudeIO is even slower by further orders of magnitude
15
Message 1: Memory is Slow
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Time
“Moore’s Law”
Source: David Patterson, UC Berkeley
2001
2002
16
Message 2: Events Tend to Cluster
A strange thing happens during program execution: A strange thing happens during program execution: Seemingly Seemingly unrelated events tend to clusterunrelated events tend to cluster
memory accessesmemory accesses tend to concentrate a majority of tend to concentrate a majority of their referenced addresses onto a small domain of their referenced addresses onto a small domain of the total address space. Even if all of memory is the total address space. Even if all of memory is accessed, during some periods of time such accessed, during some periods of time such clustering is observed. Intuitively, one memory clustering is observed. Intuitively, one memory access seems independent of another, but they access seems independent of another, but they both happen to fall onto the same page (or both happen to fall onto the same page (or working set working set of pages)of pages)
We call this phenomenon We call this phenomenon LocalityLocality!! Architects Architects exploit locality to speed up memory access via exploit locality to speed up memory access via CachesCaches and increase the address range beyond and increase the address range beyond physical memory via physical memory via Virtual Memory ManagementVirtual Memory Management. . Distinguish Distinguish spacialspacial from from temporaltemporal locality locality
17
Message 2: Events Tend to Cluster
Similarly, Similarly, hash functions hash functions tend to tend to concentrate an unproportionally large concentrate an unproportionally large number of keys onto a small number of table number of keys onto a small number of table entriesentries
Incoming search key (say, a C++ program Incoming search key (say, a C++ program identifier) is mapped into an index, but identifier) is mapped into an index, but the next, completely unrelated key, happens the next, completely unrelated key, happens to map onto the same index. In an extreme to map onto the same index. In an extreme case, this may render a hash lookup slower case, this may render a hash lookup slower than a sequential, linear searchthan a sequential, linear search
Programmer must Programmer must watch outwatch out for the for the phenomenon of clustering, as it is phenomenon of clustering, as it is undesired in hashingundesired in hashing!!
18
Message 2: Events Tend to Cluster
Clustering happens in all diverse modules of the Clustering happens in all diverse modules of the processor architecture. For example, when a processor architecture. For example, when a data data cache cache is used to speed-up memory accesses by is used to speed-up memory accesses by having a copy of frequently used data in a faster having a copy of frequently used data in a faster memory unit, it happens that a small cache memory unit, it happens that a small cache suffices to speed up executionsuffices to speed up execution
Due to Due to Data Locality Data Locality (spatial and temporal). Data (spatial and temporal). Data that have been accessed recently will again be that have been accessed recently will again be accessed in the near future, or at least data accessed in the near future, or at least data that live that live close by close by will be accessed in the near will be accessed in the near futurefuture
Thus they happen to reside in the same cache Thus they happen to reside in the same cache line. Architects do exploit this to speed up line. Architects do exploit this to speed up execution, while keeping the incremental cost for execution, while keeping the incremental cost for HW contained. Here clustering is a HW contained. Here clustering is a valuable valuable phenomenon phenomenon
19
Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can Clocking a processor fast (e.g. > 3-5 GHz) can
increase performance and thus generally “is good”increase performance and thus generally “is good”
Other performance parameters, such as memory Other performance parameters, such as memory access speed, peripheral access, etc. access speed, peripheral access, etc. do not do not scale with the clock speedscale with the clock speed. Still, increasing the . Still, increasing the clock to a higher rate is desirableclock to a higher rate is desirable
Comes at the Comes at the cost of higher currentcost of higher current, thus more , thus more heat generated in the identical physical geometry heat generated in the identical physical geometry (the real-estate) of the silicon processor or (the real-estate) of the silicon processor or also the chipsetalso the chipset
But the silicon part acts like a heat-conductor, But the silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as lower resistance causes lower voltage, shown as VDroop in the figure belowVDroop in the figure below
20
Message 3: Heat is Bad
21
Message 3: Heat is Bad This in turn means, voltage must be increased This in turn means, voltage must be increased
artificially, to sustain the clock rate, creating artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction more heat, ultimately leading to self-destruction of the partof the part
Great efforts are being made to increase the Great efforts are being made to increase the clock speed, requiring more voltage, while at the clock speed, requiring more voltage, while at the same time reducing heat generation. Current same time reducing heat generation. Current technologies include technologies include sleep-states sleep-states of the Silicon of the Silicon part (processor as well as chip-set), and part (processor as well as chip-set), and Turbo Turbo BoostBoost mode, to contain heat generation while mode, to contain heat generation while boosting clock speed just at the right timeboosting clock speed just at the right time
Good that to date Silicon manufacturing Good that to date Silicon manufacturing technologies allow the shrinking of transistors technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.larger, more expensive, and above all: hotter.
22
Message 4: Resource Replication Architects cannot increase Architects cannot increase clock clock
speed speed beyond physical limitationsbeyond physical limitations
One cannot decrease the One cannot decrease the die size die size beyond evolving technologybeyond evolving technology
Yet speed improvements are desired, Yet speed improvements are desired, and must be achievedand must be achieved
This conflict can partly be overcome This conflict can partly be overcome with replicated resources! But with replicated resources! But careful!careful!
Why carefulWhy careful? Resources could be used ? Resources could be used for better purpose!for better purpose!
23
Message 4: Resource Replication Key obstacle to parallel execution Key obstacle to parallel execution
is data dependence in the SW under is data dependence in the SW under execution. A datum cannot be used, execution. A datum cannot be used, before it has been computedbefore it has been computed
Compiler optimization technology Compiler optimization technology calls this calls this use-def dependence use-def dependence (short (short for use-before-definition), AKA for use-before-definition), AKA true true dependencedependence, AKA , AKA data dependencedata dependence
Goal is to search for program Goal is to search for program portions that are independent of one portions that are independent of one another. This can be at multiple another. This can be at multiple levels of focuslevels of focus
24
Message 4: Resource Replication
At the At the very low levelvery low level of registers, at the of registers, at the machine level –done by HW; see also score machine level –done by HW; see also score boardboard
At the At the low level low level of individual machine of individual machine instructions –done by HW; see also superscalar instructions –done by HW; see also superscalar architecturearchitecture
At the At the medium level medium level of subexpressions in a of subexpressions in a program –done by compiler; see CSEprogram –done by compiler; see CSE
At the At the higher level higher level of several statements of several statements written in sequence in high-level language written in sequence in high-level language program –done by optimizing compiler or by program –done by optimizing compiler or by programmerprogrammer
Or at the Or at the very high level very high level of different of different applications, running on the same computer, applications, running on the same computer, but with independent data, separate but with independent data, separate computations, and independent results –done by computations, and independent results –done by the user running concurrent programsthe user running concurrent programs
25
Message 4: Resource Replication Whenever program portions are independent Whenever program portions are independent
of one another, they can be computed at of one another, they can be computed at the same time: in parallel; the same time: in parallel; but will theybut will they??
Architects provide resources for this Architects provide resources for this parallelismparallelism
Compilers need to uncover opportunities Compilers need to uncover opportunities for parallelismfor parallelism
If two actions are independent of one If two actions are independent of one another, they can be computed another, they can be computed simultaneouslysimultaneously
Provided that HW resources exist, that the Provided that HW resources exist, that the absence of dependence has been proven, absence of dependence has been proven, that independent execution paths are that independent execution paths are scheduled on these replicated HW resourcesscheduled on these replicated HW resources
26
Code Samples for3 Different,
Hypothetical Architectures
27
The 3 Different Architectures
1.1. Single Accumulator ArchitectureSingle Accumulator Architecture Has one implicit register for all/any operations:
accumulator Operations frequently require intermediate temps! Code relies heavily on load-store to-from temps
2.2. Three-Address GPR ArchitectureThree-Address GPR Architecture Allows complex operations with multiple operands all
in one instruction Hence complex opcode bits
3.3. Stack Machine ArchitectureStack Machine Architecture Operands are implied on the stack, except load/store Hence all operations are simple, few bits, but all
are memory accesses
28
Code 1 for 3 Different ArchitecturesExample 1: Object Code Sequence Example 1: Object Code Sequence Without Without
OptimizationOptimization
Strict left-to-right translation, no smarts in Strict left-to-right translation, no smarts in mappingmapping
Consider non-commutative subtraction and Consider non-commutative subtraction and division operatorsdivision operators
We’ll use no common subexpression elimination We’ll use no common subexpression elimination (CSE), and no register reuse(CSE), and no register reuse
Conventional operator precedenceConventional operator precedence
For Single Accumulator SAA, Three-Address GPR, For Single Accumulator SAA, Three-Address GPR, Stack ArchitecturesStack Architectures
Sample source: Sample source: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c
29
Code 1 for 3 Different ArchitecturesNo Single-
Accumulator Three-Address GPR dest op1 op op2
Stack Machine
1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 mult b add r3, a, #3 add 4 st temp1 div r4, r3, c push b 5 ld a sub d, r2, r4 mult 6 add #3 push a 7 div c pushlit #3 8 st temp2 add 9 ld temp1 push c
10 sub temp2 div 11 st d sub 12 pop d
30
Code 1 for 3 Different ArchitecturesThree-address code looks shortest, w.r.t. Three-address code looks shortest, w.r.t. number of number of
instructionsinstructions
Maybe optical illusion, must also consider Maybe optical illusion, must also consider number of number of bitsbits for instructions for instructions
Must consider number of I-fetches, operand fetches, Must consider number of I-fetches, operand fetches, total number of storestotal number of stores
Numerous memory accesses on SAA (Single Accumulator Numerous memory accesses on SAA (Single Accumulator Architecture) due to temporary values held in memoryArchitecture) due to temporary values held in memory
Most memory accesses on SA (Stack Architecture), since Most memory accesses on SA (Stack Architecture), since everything requires a memory accesseverything requires a memory access
Three-Address architecture immune to commutativity Three-Address architecture immune to commutativity constraint, since operands may be placed in registers constraint, since operands may be placed in registers in either orderin either order
No need for reverse-operation opcodes for Three-Address No need for reverse-operation opcodes for Three-Address architecturearchitecture
Decide in Three-Address architecture how to encode Decide in Three-Address architecture how to encode operand typesoperand types
31
Code 2 for Different ArchitecturesThis time we This time we eliminate common eliminate common
subexpression (CSE)subexpression (CSE)
Compiler handles left-to-right order for Compiler handles left-to-right order for non-commutative operators on SAAnon-commutative operators on SAA
Better:Better: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c
32
Code 2 for Different Architectures
No Single-Accumulator
Three-Address GPR dest op1 op op2
Stack Machine
1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 st temp1 div r1, r1, c add 4 div c sub d, r2, r1 dup 5 st temp2 push b 6 ld temp1 mult 7 mult b xch 8 sub temp2 push c
9 st d div 10 sub 11 pop d
33
Code 2 for Different ArchitecturesSingle Accumulator Architecture (SAA) Single Accumulator Architecture (SAA)
optimized still needs temporary storage; optimized still needs temporary storage; uses uses temp1 temp1 for common subexpression; has for common subexpression; has no other register for temps!!no other register for temps!!
SAA could use SAA could use negatenegate instruction or instruction or reverse reverse subtractsubtract
Register-use optimized for Three-Address Register-use optimized for Three-Address architecturearchitecture
Common subexpresssion optimized on Stack Common subexpresssion optimized on Stack Machine by duplicating Machine by duplicating dupdup, exchanging , exchanging xchxch
20% reduced for Three-Address, 18% for SAA, 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machineonly 8% for Stack Machine
34
Code 3 for Different Architectures Analyze 2 similar expressions but with Analyze 2 similar expressions but with
increasing operator precedence left-to-right, increasing operator precedence left-to-right, in 2in 2ndnd case precedences are overridden by ( ) case precedences are overridden by ( )
One operator sequence associates right-to-One operator sequence associates right-to-left, due to arithmetic precedenceleft, due to arithmetic precedence
Compiler uses commutativityCompiler uses commutativity
The other left-to-right, due to explicit The other left-to-right, due to explicit parentheses ( )parentheses ( )
Use simple-minded code generation model: Use simple-minded code generation model: no no cache, no optimizationcache, no optimization
Will there be advantages/disadvantages caused Will there be advantages/disadvantages caused by the architecture?by the architecture?
Expression 1 is:Expression 1 is: e e a + b * c ^ d a + b * c ^ d
35
Expression 1 is : e a + b * c ^ d
Code 3 for Different Architectures
No Single-Accumulator
Three-Address GPR dest op1 op op2
Stack Machine Implied Operands
1 ld c expo r1, c, d push a 2 expo d mult r1, b, r1 push b
3 mult b add e, a, r1 push c 4 add a push d 5 st e expo 6 mult 7 add 8 pop e
Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses
• Expression 1 is: Expression 1 is: e e a + b * c ^ d a + b * c ^ d
36
Code 3 for Different Architectures
No Single-
Accumulator Three-Address GPR dest op1 op op2
Stack Machine Implied operands
1 ld g add r1, g, h push g 2 add h mult r1, i, r1 push h
3 mult i expo f, r1, j add 4 expo j push i 5 st f mult 6 push j 7 expo 8 pop f
Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same
architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well
• Expression 2 is: Expression 2 is: f f ( ( g + h ) * i ( ( g + h ) * i ) ^ j) ^ j
37
Code For Stack Architecture Stack Machine with no register inherently slow, due Stack Machine with no register inherently slow, due
to: to: Memory Accesses!!!Memory Accesses!!!
Implement few top of stack elements via HW shadow Implement few top of stack elements via HW shadow registers registers CacheCache
Let us then measure equivalent code sequences with Let us then measure equivalent code sequences with and without consideration for cacheand without consideration for cache
Top-of-stack register Top-of-stack register tos tos identifies the last valid identifies the last valid word on physical stackword on physical stack
Two shadow registers may hold 0, 1, or 2 true top Two shadow registers may hold 0, 1, or 2 true top wordswords
Top of stack cache counter Top of stack cache counter tcctcc specifies number of specifies number of shadow registers actually usedshadow registers actually used
Thus Thus tostos plus plus tcctcc jointly specify true jointly specify true top of stacktop of stack
38
Code For Stack Architecture
free free
0,1,20,1,2
tcc tcc
2 tos registers 2 tos registers
stack stack
tos tos
39
Code For Stack ArchitectureTimings for push, pushlit, add, pop operations Timings for push, pushlit, add, pop operations
depend on tccdepend on tcc
Operations in shadow registers fastest, typically 1 Operations in shadow registers fastest, typically 1 cycle, include register access and the operation cycle, include register access and the operation itselfitself
Generally, memory access adds 2 cyclesGenerally, memory access adds 2 cycles
For stack changes use some defined policy, e.g. keep For stack changes use some defined policy, e.g. keep tcc 50% fulltcc 50% full
Table below refines timings for stack with shadow Table below refines timings for stack with shadow registersregisters
Note: push x into cache with free space requires 2 Note: push x into cache with free space requires 2 cycles, which are for the memory fetch: cache cycles, which are for the memory fetch: cache adjustment is done at the same time as memory adjustment is done at the same time as memory fetchfetch
40
Code For Stack Architecture
operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update
in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?
41
Code For Stack ArchitectureCode emission for: Code emission for: a + b * c ^ ( d + e * f a + b * c ^ ( d + e * f
^ g )^ g )
Let + and * be commutative, by language Let + and * be commutative, by language rulerule
Architecture here has 2 shadow registers, Architecture here has 2 shadow registers, compiler compiler exploitsexploits this this
Assume initially empty 2-word cacheAssume initially empty 2-word cache
42
Code For Stack Architecture
# 1 Left - to - Right cycles 1 2 Exploit Cache cycles
2
1 push a 2 push f 2
2 push b 2 push g 2
3 push c 4 e xpo 1
4 push d 4 push e 2
5 push e 4 m ult 1
6 push f 4 push d 2
7 push g 4 add 1
8 expo 1 push c 2
9 mult 3 er_expoo = swap + expo 1
10 add 3 push b 2
11 expo 3 m ult 1
12 m ult 3 push a 2
13 a dd 3 a dd 1
43
Code For Stack ArchitectureBlind Blind code emission costs 40 cycles; i.e. not taking code emission costs 40 cycles; i.e. not taking
advantage of tcc knowledge: costs performanceadvantage of tcc knowledge: costs performance
Code emission with shadow register consideration costs 20 Code emission with shadow register consideration costs 20 cyclescycles
True penalty for memory access is worse in practiceTrue penalty for memory access is worse in practice
Tremendous speed-up always possible when fixing system Tremendous speed-up always possible when fixing system with severe flawswith severe flaws
Return of investment for 2 registers is twice the Return of investment for 2 registers is twice the original performanceoriginal performance
Such strong speedup is an indicator that the starting Such strong speedup is an indicator that the starting architecture was poorarchitecture was poor
Stack Machine can be fast, if purity of top-of-stack Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performanceaccess is sacrificed for performance
Note that indexing, looping, indirection, call/return are Note that indexing, looping, indirection, call/return are not addressed herenot addressed here
44
Data Dependences (sic.)Register Dependencies
45
Register Dependencies Inter-instruction dependenInter-instruction dependenciescies, in CS , in CS
parlance also known as parlance also known as dependendependencesces, arise , arise between registers being between registers being defineddefined and and usedused
One instruction computes a result into a One instruction computes a result into a register (or memory), another instruction register (or memory), another instruction needs that result from that same register needs that result from that same register (or that memory location)(or that memory location)
Or, one instruction uses a datum; and after Or, one instruction uses a datum; and after its use the same item is then recomputedits use the same item is then recomputed
Dependences require sequential execution, Dependences require sequential execution, lest the result is unpredictablelest the result is unpredictable
46
Register DependenciesTrue-DependenceTrue-Dependence, AKA Data Dependence: <- synonymous!, AKA Data Dependence: <- synonymous!r3 ←r3 ← r1 op r2 r1 op r2r5 ← r5 ← r3r3 op r4 op r4 Read after Write, RAWRead after Write, RAW
Anti-Dependence,Anti-Dependence, not a true dependence not a true dependenceparallelize under right conditionparallelize under right conditionr3 ← r3 ← r1r1 op r2 op r2r1r1 ← r5 op r4 ← r5 op r4 Write after read, WARWrite after read, WAR
Output DependenceOutput Dependencer3r3 ← r1 op r2 ← r1 op r2r5 ← r5 ← r3r3 op r4 op r4r3 r3 ← r6 op r7← r6 op r7 Write after Write, WAW, use in Write after Write, WAW, use in
betweenbetween
47
Register Dependencies
Control Dependence:Control Dependence:
if ( condition1 ) {if ( condition1 ) {
r3 = r1 op r2;r3 = r1 op r2;
}else{}else{ see the jump here? see the jump here?
r5 = r3 op r4;r5 = r3 op r4;
} // end if} // end if
write( r3 );write( r3 );
48
Register Renaming Only Only data dependence data dependence is a is a real real
dependence, dependence, hence called hence called true dependencetrue dependence
Other dependences are artifacts of Other dependences are artifacts of insufficient resourcesinsufficient resources, generally , generally insufficient registersinsufficient registers
This means: if additional registers were This means: if additional registers were available, then replacing some of these available, then replacing some of these conflicting registers with new ones conflicting registers with new ones could make the conflict disappear!could make the conflict disappear!
Anti-Anti- and and Output-Output-Dependences are indeed Dependences are indeed such such falsefalse dependences dependences
49
Register Renaming
OriginalOriginal Code: Code:
L1:L1: r1 ← r2 op r3r1 ← r2 op r3
L2:L2: r4 ← r1 op r5r4 ← r1 op r5
L3:L3: r1 ← r3 op r6r1 ← r3 op r6
L4:L4: r3 ← r1 op r7r3 ← r1 op r7
Dependences Dependences beforebefore::
Lx Ly which dependence?Lx Ly which dependence?
50
Register Renaming OriginalOriginal Code: Code: NewNew Code, Code, afterafter adding adding
regs:regs:
L1:L1: r1 ← r2 op r3r1 ← r2 op r3 r10 ← r2 op r30 –- r30 insteadr10 ← r2 op r30 –- r30 instead
L2:L2: r4 ← r1 op r5r4 ← r1 op r5 r4 ← r10 op r5 –- r10 insteadr4 ← r10 op r5 –- r10 instead
L3:L3: r1 ← r3 op r6r1 ← r3 op r6 r1 ← r30 op r6r1 ← r30 op r6
L4:L4: r3 ← r1 op r7r3 ← r1 op r7 r3 ← r1 op r7r3 ← r1 op r7
Dependences Dependences beforebefore:: Dependences Dependences afterafter::
L1, L2 true-Dep with r1L1, L2 true-Dep with r1 L1, L2 true-Dep with r10L1, L2 true-Dep with r10
L1, L3 output-Dep with r1L1, L3 output-Dep with r1 L3, L4 true-Dep with r1L3, L4 true-Dep with r1
L1, L4 anti-Dep with r3L1, L4 anti-Dep with r3
L3, L4 true-Dep with r1L3, L4 true-Dep with r1
L2, L3 anti-Dep with r1L2, L3 anti-Dep with r1
L3, L4 anti-Dep with r3L3, L4 anti-Dep with r3
51
Register RenamingWith these additional --or renamed-- regs, With these additional --or renamed-- regs,
the new code could possibly run in half the new code could possibly run in half the time!the time!
First: Compute into r10 instead of r1, but First: Compute into r10 instead of r1, but you need to have the additional register you need to have the additional register r10; r10; no time penalty!no time penalty!
Also: Compute into r30 instead of r3, if r30 Also: Compute into r30 instead of r3, if r30 available; available; also no time penalty!also no time penalty!
Then the following regs are Then the following regs are livelive afterwards: afterwards: r1, r3, r4r1, r3, r4
While r10 and r30 are While r10 and r30 are don’t cares don’t cares afterwardsafterwards
52
Score BoardScore-board is an array of HW programmable bits Score-board is an array of HW programmable bits sb[]sb[]
Manages other HW resources, specifically registersManages other HW resources, specifically registers
In the single-bit HW array sb[], every bit In the single-bit HW array sb[], every bit ii in in sb[i]sb[i] is is associated with a specific register, the one associated with a specific register, the one identified by identified by ii , i.e. , i.e. rrii
Association is by index, i.e. by name: Association is by index, i.e. by name: sb[i]sb[i] belongs to belongs to reg reg rrii
Only if Only if sb[i] = 0sb[i] = 0, does that register , does that register ii have have valid datavalid data
If If sb[i] = 0 sb[i] = 0 then register then register rrii is is NOT in process of being NOT in process of being writtenwritten
If bit If bit ii is set, i.e. if is set, i.e. if sb[i] = 1sb[i] = 1, then that register , then that register rrii
is reservedis reserved
Initially all Initially all sb[*]sb[*] are free to use, i.e. set to 0 are free to use, i.e. set to 0
53
Score BoardExecution constraints:Execution constraints:
rrdd ← r ← rss op r op rtt
if if sb[s]sb[s] or if or if sb[t]sb[t] is set → RAW dependence, is set → RAW dependence, hence stall the computation; wait until hence stall the computation; wait until both both rrss and and rrtt are available are available
if if sb[d]sb[d] is set→ WAW dependence, hence stall is set→ WAW dependence, hence stall the write; wait until the write; wait until rrdd has been used; SW has been used; SW can sometimes determine to use another can sometimes determine to use another register instead of register instead of rrdd
Else, if none of the 3 registers are in use, Else, if none of the 3 registers are in use, dispatch the instruction immediatelydispatch the instruction immediately
54
Score BoardTo allow To allow out of order (ooo) executionout of order (ooo) execution, , upon computing the value of rupon computing the value of rdd
Update Update rrdd, and clear , and clear sb[d]sb[d]
For uses (references), HW may use any For uses (references), HW may use any register register ii, whose , whose sb[i]sb[i] is 0 is 0
For definitions (assignments), HW may set For definitions (assignments), HW may set any register any register jj, whose , whose sb[j]sb[j] is 0 is 0
Independent of original order, in which Independent of original order, in which source program was writtensource program was written, i.e. possibly ooo
55
References1.1. The Humble Programmer: The Humble Programmer:
http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmhttp://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmll
2.2. Algorithm Definitions: Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizationshttp://en.wikipedia.org/wiki/Algorithm_characterizations
3.3. http://en.wikipedia.org/wiki/Moore's_lawhttp://en.wikipedia.org/wiki/Moore's_law
4.4. C. A. R. HoareC. A. R. Hoare’’s comment on readability: s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdfpdf
5.5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16Construction, Volume 21, Number 7, July 1986, pp 11-16
6.6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/turing/
7.7. Linux design: Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htmhttp://www.livinginternet.com/i/iw_unix_gnulinux.htm
8.8. Words of wisdom: http://www.cs.yale.edu/quotes.htmlWords of wisdom: http://www.cs.yale.edu/quotes.html
9.9. John von Neumann’s computer design: A.H. Taub (ed.), “Collected John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963Co., New York 1963