CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

1

CS 201Computer Systems Programming

Chapter 3“Architecture Overview”

Herbert G. Mayer, PSU CSHerbert G. Mayer, PSU CSStatus 6/29/2014Status 6/29/2014

2

Syllabus Computing HistoryComputing History

Evolution of Microprocessor µP Evolution of Microprocessor µP PerformancePerformance

Processor Performance GrowthProcessor Performance Growth

Key Architecture MessagesKey Architecture Messages

Code Sequences for 3 Different Code Sequences for 3 Different ArchitecturesArchitectures

Dependencies, AKA DependencesDependencies, AKA Dependences

Score BoardScore Board

ReferencesReferences

3

Computing HistoryComputing HistoryLong, long before 1940s:Long, long before 1940s:1643 Pascal’s 1643 Pascal’s Arithmetic MachineArithmetic Machine

About 1660 Leibnitz About 1660 Leibnitz Four Function CalculatorFour Function Calculator

1710 -1750 1710 -1750 Punched CardsPunched Cards by Bouchon, Falcon, by Bouchon, Falcon, JacquardJacquard

1810 Babbage 1810 Babbage Difference EngineDifference Engine, unfinished; 1, unfinished; 1stst programmer ever in the world was programmer ever in the world was AdaAda, poet Lord , poet Lord Byron’s daughter, after whom the language Ada Byron’s daughter, after whom the language Ada was named: was named: Lady Ada LovelaceLady Ada Lovelace

1835 Babbage 1835 Babbage Analytical EngineAnalytical Engine, also unfinished, also unfinished

1920 Hollerith 1920 Hollerith Tabulating MachineTabulating Machine to help with to help with census in the USAcensus in the USA

4

Computing HistoryComputing HistoryDecade of 1940sDecade of 1940s1939 – 1942 1939 – 1942 John Atanasoff John Atanasoff built programmable, built programmable,

electronic computer at Iowa State Universityelectronic computer at Iowa State University

1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague mechanical computers based on relays; colleague advised use of “vacuum tubes”advised use of “vacuum tubes”

1946 1946 John von Neumann’s John von Neumann’s computer design of stored computer design of stored programprogram

1946 Mauchly and Eckert built 1946 Mauchly and Eckert built ENIACENIAC, modeled after , modeled after Atanasoff’s ideas, built at University of Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monsterComputer, 30 ton monster

1980s 1980s John Atanasoff John Atanasoff got acknowledgment and patent got acknowledgment and patent officially officially

5

Computing HistoryComputing HistoryDecade of the 1950sDecade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, Univac Uniprocessor based on ENIAC, commercially viable,

developed by developed by John Mauchly John Mauchly and John Presper Eckertand John Presper Eckert Commercial systems sold by Remington RandCommercial systems sold by Remington Rand Mark III computerMark III computer

Decade of the 1960s Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al.IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tubeTransistor replaces vacuum tube Burroughs stack machines, compete with GPR architecturesBurroughs stack machines, compete with GPR architectures All still All still von Neumannvon Neumann architectures architectures 1969 1969 ARPANETARPANET CacheCache and and VMMVMM developed, first at Manchester University developed, first at Manchester University

6

Computing HistoryComputing History

Decade of the 1970sDecade of the 1970sBirth of Microprocessor at Intel, Birth of Microprocessor at Intel,

see see Gordon MooreGordon Moore

High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 seriesseries

Architecture advances: Caches, Architecture advances: Caches, virtual virtual memories memories (VMM) ubiquitous, since (VMM) ubiquitous, since realreal memories were expensive memories were expensive

Intel 4004, Intel 8080, single-chip microprocessorsIntel 4004, Intel 8080, single-chip microprocessors

Programmable controllersProgrammable controllers

Mini-computers, PDP 11, HP 3000 16-bit computerMini-computers, PDP 11, HP 3000 16-bit computer

Height of Digital Equipment Corp. (DEC)Height of Digital Equipment Corp. (DEC)

Birth of Birth of personal computerspersonal computers, which DEC misses!, which DEC misses!

7


Decade of the 1980sDecade of the 1980s

decrease of mini-computer usedecrease of mini-computer use

32-bit computing even on minis32-bit computing even on minis

Architecture advances: superscalar, faster Architecture advances: superscalar, faster caches, larger cachescaches, larger caches

Multitude of Supercomputer manufacturersMultitude of Supercomputer manufacturers

Compiler complexity: trace-scheduling, VLIWCompiler complexity: trace-scheduling, VLIW

Workstations common: Apollo, HP, DEC’s Workstations common: Apollo, HP, DEC’s Ken Ken Olsen Olsen trying to catch up, Intergraph, trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Ardent, Sun, Three Rivers, Silicon Graphics, etc.Graphics, etc.

8


Decade of the 1990sDecade of the 1990s•Architecture advances: superscalar & Architecture advances: superscalar & pipelined, speculative execution, ooo pipelined, speculative execution, ooo executionexecution

•Powerful desktopsPowerful desktops

•End of mini-computer and of many super-End of mini-computer and of many super-computer manufacturerscomputer manufacturers

•Microprocessor powerful as early Microprocessor powerful as early supercomputerssupercomputers

•Consolidation of many computer companies into Consolidation of many computer companies into few larger onesfew larger ones

•End of USSR marked the demise of several End of USSR marked the demise of several supercomputer companiessupercomputer companies

9

Evolution of µP Performance(by: James C. Hoe @ CMU)

1970s 1980s 1990s 2000+ Transistor Count 10k-100k 100k-1M 1M-100M 1B

Clock Frequency 0.2-2 MHz 2-20 MHz 0.02 – 1 GHz 10 GHz

Instructions / cycle: ipc < 0.1 0.1 – 0.9 0.9 – 2.0 > 10 (?)

MIPs, FLOPs < 0.2 0.2 - 20 20 – 2,000 100,000

10

Processor Performance GrowthMoore’s Law --from Webopedia 8/27/2004:Moore’s Law --from Webopedia 8/27/2004:

““The observation made in 1965 by Gordon Moore, co-The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every square inch on integrated circuits had doubled every year since it was invented. Moore predicted that year since it was invented. Moore predicted that this trend would continue for the foreseeable this trend would continue for the foreseeable future.future.

In subsequent years, the pace slowed down a bit, but In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 monthsdata density doubled approximately every 18 months, , and this is the current definition of and this is the current definition of Moore's LawMoore's Law, , which which Moore himself has blessedMoore himself has blessed. Most experts, . Most experts, including Moore himself, expect including Moore himself, expect Moore's LawMoore's Law to hold to hold for another two decades.for another two decades.

Others coin a more general law, a bit lamely stating Others coin a more general law, a bit lamely stating that that “the circuit density increases predictably over “the circuit density increases predictably over time.”time.”

11

Processor Performance GrowthSo far in 2014, Moore’s Law is holding true since So far in 2014, Moore’s Law is holding true since

~1968.~1968.

Some Intel Some Intel fellows fellows believe that an end to Moore’s Law believe that an end to Moore’s Law will be reached ~2018 due to physical limitations will be reached ~2018 due to physical limitations in the process of manufacturing transistors from in the process of manufacturing transistors from semi-conductor material.semi-conductor material.

Such phenomenal growth is unknown in any other Such phenomenal growth is unknown in any other industry. For example, if doubling of performance industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 could be achieved every 18 months, then by 2001 other industries would have achieved the other industries would have achieved the following:following:

Cars would travel at 2,400,000 Mph, and get 600,000 Cars would travel at 2,400,000 Mph, and get 600,000 MpGMpG

Air travel LA to NYC would be at 36,000 Mach, take Air travel LA to NYC would be at 36,000 Mach, take 0.5 seconds0.5 seconds

12

KeyArchitecture

Messages

13

Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, The inner core of the processor, the CPU or the µP,

is getting faster at a steady rateis getting faster at a steady rate

Access to memoryAccess to memory is also getting faster over time, is also getting faster over time, but but at a slower rateat a slower rate. This rate differential has . This rate differential has existed for quite some time, with the strange existed for quite some time, with the strange effect that fast processors have to rely on effect that fast processors have to rely on progressively slower memories –relatively speakingprogressively slower memories –relatively speaking

Not uncommon on MP server that processor has to Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes; wait >100 cycles before a memory access completes; that is one single memory accessthat is one single memory access. On a Multi-. On a Multi-Processor the bus protocol is more complex due to Processor the bus protocol is more complex due to snooping, backing-off, arbitration, thus the number snooping, backing-off, arbitration, thus the number of cycles to complete a memory access can grow highof cycles to complete a memory access can grow high

IO simply compounds the problem of slow memory IO simply compounds the problem of slow memory accessaccess

14

Message 1: Memory is Slow Discarding conventional memory altogether, relying only Discarding conventional memory altogether, relying only

on cache-like memories, is NOT an option for 64-bit on cache-like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you architectures, due to the price/size/cost/power if you pursue full memory population with 2pursue full memory population with 26464 bytes bytes

Another way of seeing this: Using solely reasonably-Another way of seeing this: Using solely reasonably-priced cache memories (say at < 10 times the cost of priced cache memories (say at < 10 times the cost of regular memory) is not feasible: resulting physical regular memory) is not feasible: resulting physical address space would be too small, or price too highaddress space would be too small, or price too high

Significant intellectual efforts in computer Significant intellectual efforts in computer architecture focuses on architecture focuses on reducing the performance impact reducing the performance impact of fast processors accessing slow, virtualized memoriesof fast processors accessing slow, virtualized memories

All else except IO, seems easy compared to this All else except IO, seems easy compared to this fundamental problem!fundamental problem!

IO is even slower by further orders of magnitudeIO is even slower by further orders of magnitude

15

Message 1: Memory is Slow

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Time

“Moore’s Law”

Source: David Patterson, UC Berkeley

2001

2002

16

Message 2: Events Tend to Cluster

A strange thing happens during program execution: A strange thing happens during program execution: Seemingly Seemingly unrelated events tend to clusterunrelated events tend to cluster

memory accessesmemory accesses tend to concentrate a majority of tend to concentrate a majority of their referenced addresses onto a small domain of their referenced addresses onto a small domain of the total address space. Even if all of memory is the total address space. Even if all of memory is accessed, during some periods of time such accessed, during some periods of time such clustering is observed. Intuitively, one memory clustering is observed. Intuitively, one memory access seems independent of another, but they access seems independent of another, but they both happen to fall onto the same page (or both happen to fall onto the same page (or working set working set of pages)of pages)

We call this phenomenon We call this phenomenon LocalityLocality!! Architects Architects exploit locality to speed up memory access via exploit locality to speed up memory access via CachesCaches and increase the address range beyond and increase the address range beyond physical memory via physical memory via Virtual Memory ManagementVirtual Memory Management. . Distinguish Distinguish spacialspacial from from temporaltemporal locality locality

17


Similarly, Similarly, hash functions hash functions tend to tend to concentrate an unproportionally large concentrate an unproportionally large number of keys onto a small number of table number of keys onto a small number of table entriesentries

Incoming search key (say, a C++ program Incoming search key (say, a C++ program identifier) is mapped into an index, but identifier) is mapped into an index, but the next, completely unrelated key, happens the next, completely unrelated key, happens to map onto the same index. In an extreme to map onto the same index. In an extreme case, this may render a hash lookup slower case, this may render a hash lookup slower than a sequential, linear searchthan a sequential, linear search

Programmer must Programmer must watch outwatch out for the for the phenomenon of clustering, as it is phenomenon of clustering, as it is undesired in hashingundesired in hashing!!

18


Clustering happens in all diverse modules of the Clustering happens in all diverse modules of the processor architecture. For example, when a processor architecture. For example, when a data data cache cache is used to speed-up memory accesses by is used to speed-up memory accesses by having a copy of frequently used data in a faster having a copy of frequently used data in a faster memory unit, it happens that a small cache memory unit, it happens that a small cache suffices to speed up executionsuffices to speed up execution

Due to Due to Data Locality Data Locality (spatial and temporal). Data (spatial and temporal). Data that have been accessed recently will again be that have been accessed recently will again be accessed in the near future, or at least data accessed in the near future, or at least data that live that live close by close by will be accessed in the near will be accessed in the near futurefuture

Thus they happen to reside in the same cache Thus they happen to reside in the same cache line. Architects do exploit this to speed up line. Architects do exploit this to speed up execution, while keeping the incremental cost for execution, while keeping the incremental cost for HW contained. Here clustering is a HW contained. Here clustering is a valuable valuable phenomenon phenomenon

19

Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can Clocking a processor fast (e.g. > 3-5 GHz) can

increase performance and thus generally “is good”increase performance and thus generally “is good”

Other performance parameters, such as memory Other performance parameters, such as memory access speed, peripheral access, etc. access speed, peripheral access, etc. do not do not scale with the clock speedscale with the clock speed. Still, increasing the . Still, increasing the clock to a higher rate is desirableclock to a higher rate is desirable

Comes at the Comes at the cost of higher currentcost of higher current, thus more , thus more heat generated in the identical physical geometry heat generated in the identical physical geometry (the real-estate) of the silicon processor or (the real-estate) of the silicon processor or also the chipsetalso the chipset

But the silicon part acts like a heat-conductor, But the silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as lower resistance causes lower voltage, shown as VDroop in the figure belowVDroop in the figure below

20

Message 3: Heat is Bad

21

Message 3: Heat is Bad This in turn means, voltage must be increased This in turn means, voltage must be increased

artificially, to sustain the clock rate, creating artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction more heat, ultimately leading to self-destruction of the partof the part

Great efforts are being made to increase the Great efforts are being made to increase the clock speed, requiring more voltage, while at the clock speed, requiring more voltage, while at the same time reducing heat generation. Current same time reducing heat generation. Current technologies include technologies include sleep-states sleep-states of the Silicon of the Silicon part (processor as well as chip-set), and part (processor as well as chip-set), and Turbo Turbo BoostBoost mode, to contain heat generation while mode, to contain heat generation while boosting clock speed just at the right timeboosting clock speed just at the right time

Good that to date Silicon manufacturing Good that to date Silicon manufacturing technologies allow the shrinking of transistors technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.larger, more expensive, and above all: hotter.

22

Message 4: Resource Replication Architects cannot increase Architects cannot increase clock clock

speed speed beyond physical limitationsbeyond physical limitations

One cannot decrease the One cannot decrease the die size die size beyond evolving technologybeyond evolving technology

Yet speed improvements are desired, Yet speed improvements are desired, and must be achievedand must be achieved

This conflict can partly be overcome This conflict can partly be overcome with replicated resources! But with replicated resources! But careful!careful!

Why carefulWhy careful? Resources could be used ? Resources could be used for better purpose!for better purpose!

23

Message 4: Resource Replication Key obstacle to parallel execution Key obstacle to parallel execution

is data dependence in the SW under is data dependence in the SW under execution. A datum cannot be used, execution. A datum cannot be used, before it has been computedbefore it has been computed

Compiler optimization technology Compiler optimization technology calls this calls this use-def dependence use-def dependence (short (short for use-before-definition), AKA for use-before-definition), AKA true true dependencedependence, AKA , AKA data dependencedata dependence

Goal is to search for program Goal is to search for program portions that are independent of one portions that are independent of one another. This can be at multiple another. This can be at multiple levels of focuslevels of focus

24

Message 4: Resource Replication

At the At the very low levelvery low level of registers, at the of registers, at the machine level –done by HW; see also score machine level –done by HW; see also score boardboard

At the At the low level low level of individual machine of individual machine instructions –done by HW; see also superscalar instructions –done by HW; see also superscalar architecturearchitecture

At the At the medium level medium level of subexpressions in a of subexpressions in a program –done by compiler; see CSEprogram –done by compiler; see CSE

At the At the higher level higher level of several statements of several statements written in sequence in high-level language written in sequence in high-level language program –done by optimizing compiler or by program –done by optimizing compiler or by programmerprogrammer

Or at the Or at the very high level very high level of different of different applications, running on the same computer, applications, running on the same computer, but with independent data, separate but with independent data, separate computations, and independent results –done by computations, and independent results –done by the user running concurrent programsthe user running concurrent programs

25

Message 4: Resource Replication Whenever program portions are independent Whenever program portions are independent

of one another, they can be computed at of one another, they can be computed at the same time: in parallel; the same time: in parallel; but will theybut will they??

Architects provide resources for this Architects provide resources for this parallelismparallelism

Compilers need to uncover opportunities Compilers need to uncover opportunities for parallelismfor parallelism

If two actions are independent of one If two actions are independent of one another, they can be computed another, they can be computed simultaneouslysimultaneously

Provided that HW resources exist, that the Provided that HW resources exist, that the absence of dependence has been proven, absence of dependence has been proven, that independent execution paths are that independent execution paths are scheduled on these replicated HW resourcesscheduled on these replicated HW resources

26

Code Samples for3 Different,

Hypothetical Architectures

27

The 3 Different Architectures

1.1. Single Accumulator ArchitectureSingle Accumulator Architecture Has one implicit register for all/any operations:

accumulator Operations frequently require intermediate temps! Code relies heavily on load-store to-from temps

2.2. Three-Address GPR ArchitectureThree-Address GPR Architecture Allows complex operations with multiple operands all

in one instruction Hence complex opcode bits

3.3. Stack Machine ArchitectureStack Machine Architecture Operands are implied on the stack, except load/store Hence all operations are simple, few bits, but all

are memory accesses

28

Code 1 for 3 Different ArchitecturesExample 1: Object Code Sequence Example 1: Object Code Sequence Without Without

OptimizationOptimization

Strict left-to-right translation, no smarts in Strict left-to-right translation, no smarts in mappingmapping

Consider non-commutative subtraction and Consider non-commutative subtraction and division operatorsdivision operators

We’ll use no common subexpression elimination We’ll use no common subexpression elimination (CSE), and no register reuse(CSE), and no register reuse

Conventional operator precedenceConventional operator precedence

For Single Accumulator SAA, Three-Address GPR, For Single Accumulator SAA, Three-Address GPR, Stack ArchitecturesStack Architectures

Sample source: Sample source: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c

29

Code 1 for 3 Different ArchitecturesNo Single-

Accumulator Three-Address GPR dest op1 op op2

Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 mult b add r3, a, #3 add 4 st temp1 div r4, r3, c push b 5 ld a sub d, r2, r4 mult 6 add #3 push a 7 div c pushlit #3 8 st temp2 add 9 ld temp1 push c

10 sub temp2 div 11 st d sub 12 pop d

30

Code 1 for 3 Different ArchitecturesThree-address code looks shortest, w.r.t. Three-address code looks shortest, w.r.t. number of number of

instructionsinstructions

Maybe optical illusion, must also consider Maybe optical illusion, must also consider number of number of bitsbits for instructions for instructions

Must consider number of I-fetches, operand fetches, Must consider number of I-fetches, operand fetches, total number of storestotal number of stores

Numerous memory accesses on SAA (Single Accumulator Numerous memory accesses on SAA (Single Accumulator Architecture) due to temporary values held in memoryArchitecture) due to temporary values held in memory

Most memory accesses on SA (Stack Architecture), since Most memory accesses on SA (Stack Architecture), since everything requires a memory accesseverything requires a memory access

Three-Address architecture immune to commutativity Three-Address architecture immune to commutativity constraint, since operands may be placed in registers constraint, since operands may be placed in registers in either orderin either order

No need for reverse-operation opcodes for Three-Address No need for reverse-operation opcodes for Three-Address architecturearchitecture

Decide in Three-Address architecture how to encode Decide in Three-Address architecture how to encode operand typesoperand types

31

Code 2 for Different ArchitecturesThis time we This time we eliminate common eliminate common

subexpression (CSE)subexpression (CSE)

Compiler handles left-to-right order for Compiler handles left-to-right order for non-commutative operators on SAAnon-commutative operators on SAA

Better:Better: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c

32

Code 2 for Different Architectures

No Single-Accumulator

Three-Address GPR dest op1 op op2

Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 st temp1 div r1, r1, c add 4 div c sub d, r2, r1 dup 5 st temp2 push b 6 ld temp1 mult 7 mult b xch 8 sub temp2 push c

9 st d div 10 sub 11 pop d

33

Code 2 for Different ArchitecturesSingle Accumulator Architecture (SAA) Single Accumulator Architecture (SAA)

optimized still needs temporary storage; optimized still needs temporary storage; uses uses temp1 temp1 for common subexpression; has for common subexpression; has no other register for temps!!no other register for temps!!

SAA could use SAA could use negatenegate instruction or instruction or reverse reverse subtractsubtract

Register-use optimized for Three-Address Register-use optimized for Three-Address architecturearchitecture

Common subexpresssion optimized on Stack Common subexpresssion optimized on Stack Machine by duplicating Machine by duplicating dupdup, exchanging , exchanging xchxch

20% reduced for Three-Address, 18% for SAA, 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machineonly 8% for Stack Machine

34

Code 3 for Different Architectures Analyze 2 similar expressions but with Analyze 2 similar expressions but with

increasing operator precedence left-to-right, increasing operator precedence left-to-right, in 2in 2ndnd case precedences are overridden by ( ) case precedences are overridden by ( )

One operator sequence associates right-to-One operator sequence associates right-to-left, due to arithmetic precedenceleft, due to arithmetic precedence

Compiler uses commutativityCompiler uses commutativity

The other left-to-right, due to explicit The other left-to-right, due to explicit parentheses ( )parentheses ( )

Use simple-minded code generation model: Use simple-minded code generation model: no no cache, no optimizationcache, no optimization

Will there be advantages/disadvantages caused Will there be advantages/disadvantages caused by the architecture?by the architecture?

Expression 1 is:Expression 1 is: e e a + b * c ^ d a + b * c ^ d

35

Expression 1 is : e a + b * c ^ d


No Single-Accumulator

Three-Address GPR dest op1 op op2

Stack Machine Implied Operands

1 ld c expo r1, c, d push a 2 expo d mult r1, b, r1 push b

3 mult b add e, a, r1 push c 4 add a push d 5 st e expo 6 mult 7 add 8 pop e

Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses

• Expression 1 is: Expression 1 is: e e a + b * c ^ d a + b * c ^ d

36


No Single-

Accumulator Three-Address GPR dest op1 op op2

Stack Machine Implied operands

1 ld g add r1, g, h push g 2 add h mult r1, i, r1 push h

3 mult i expo f, r1, j add 4 expo j push i 5 st f mult 6 push j 7 expo 8 pop f

Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same

architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well

• Expression 2 is: Expression 2 is: f f ( ( g + h ) * i ( ( g + h ) * i ) ^ j) ^ j

37

Code For Stack Architecture Stack Machine with no register inherently slow, due Stack Machine with no register inherently slow, due

to: to: Memory Accesses!!!Memory Accesses!!!

Implement few top of stack elements via HW shadow Implement few top of stack elements via HW shadow registers registers CacheCache

Let us then measure equivalent code sequences with Let us then measure equivalent code sequences with and without consideration for cacheand without consideration for cache

Top-of-stack register Top-of-stack register tos tos identifies the last valid identifies the last valid word on physical stackword on physical stack

Two shadow registers may hold 0, 1, or 2 true top Two shadow registers may hold 0, 1, or 2 true top wordswords

Top of stack cache counter Top of stack cache counter tcctcc specifies number of specifies number of shadow registers actually usedshadow registers actually used

Thus Thus tostos plus plus tcctcc jointly specify true jointly specify true top of stacktop of stack

38

Code For Stack Architecture

free free

0,1,20,1,2

tcc tcc

2 tos registers 2 tos registers

stack stack

tos tos

39

Code For Stack ArchitectureTimings for push, pushlit, add, pop operations Timings for push, pushlit, add, pop operations

depend on tccdepend on tcc

Operations in shadow registers fastest, typically 1 Operations in shadow registers fastest, typically 1 cycle, include register access and the operation cycle, include register access and the operation itselfitself

Generally, memory access adds 2 cyclesGenerally, memory access adds 2 cycles

For stack changes use some defined policy, e.g. keep For stack changes use some defined policy, e.g. keep tcc 50% fulltcc 50% full

Table below refines timings for stack with shadow Table below refines timings for stack with shadow registersregisters

Note: push x into cache with free space requires 2 Note: push x into cache with free space requires 2 cycles, which are for the memory fetch: cache cycles, which are for the memory fetch: cache adjustment is done at the same time as memory adjustment is done at the same time as memory fetchfetch

40


operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update

in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?

41

Code For Stack ArchitectureCode emission for: Code emission for: a + b * c ^ ( d + e * f a + b * c ^ ( d + e * f

^ g )^ g )

Let + and * be commutative, by language Let + and * be commutative, by language rulerule

Architecture here has 2 shadow registers, Architecture here has 2 shadow registers, compiler compiler exploitsexploits this this

Assume initially empty 2-word cacheAssume initially empty 2-word cache

42


# 1 Left - to - Right cycles 1 2 Exploit Cache cycles

2

1 push a 2 push f 2

2 push b 2 push g 2

3 push c 4 e xpo 1

4 push d 4 push e 2

5 push e 4 m ult 1

6 push f 4 push d 2

7 push g 4 add 1

8 expo 1 push c 2

9 mult 3 er_expoo = swap + expo 1

10 add 3 push b 2

11 expo 3 m ult 1

12 m ult 3 push a 2

13 a dd 3 a dd 1

43

Code For Stack ArchitectureBlind Blind code emission costs 40 cycles; i.e. not taking code emission costs 40 cycles; i.e. not taking

advantage of tcc knowledge: costs performanceadvantage of tcc knowledge: costs performance

Code emission with shadow register consideration costs 20 Code emission with shadow register consideration costs 20 cyclescycles

True penalty for memory access is worse in practiceTrue penalty for memory access is worse in practice

Tremendous speed-up always possible when fixing system Tremendous speed-up always possible when fixing system with severe flawswith severe flaws

Return of investment for 2 registers is twice the Return of investment for 2 registers is twice the original performanceoriginal performance

Such strong speedup is an indicator that the starting Such strong speedup is an indicator that the starting architecture was poorarchitecture was poor

Stack Machine can be fast, if purity of top-of-stack Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performanceaccess is sacrificed for performance

Note that indexing, looping, indirection, call/return are Note that indexing, looping, indirection, call/return are not addressed herenot addressed here

44

Data Dependences (sic.)Register Dependencies

45

Register Dependencies Inter-instruction dependenInter-instruction dependenciescies, in CS , in CS

parlance also known as parlance also known as dependendependencesces, arise , arise between registers being between registers being defineddefined and and usedused

One instruction computes a result into a One instruction computes a result into a register (or memory), another instruction register (or memory), another instruction needs that result from that same register needs that result from that same register (or that memory location)(or that memory location)

Or, one instruction uses a datum; and after Or, one instruction uses a datum; and after its use the same item is then recomputedits use the same item is then recomputed

Dependences require sequential execution, Dependences require sequential execution, lest the result is unpredictablelest the result is unpredictable

46

Register DependenciesTrue-DependenceTrue-Dependence, AKA Data Dependence: <- synonymous!, AKA Data Dependence: <- synonymous!r3 ←r3 ← r1 op r2 r1 op r2r5 ← r5 ← r3r3 op r4 op r4 Read after Write, RAWRead after Write, RAW

Anti-Dependence,Anti-Dependence, not a true dependence not a true dependenceparallelize under right conditionparallelize under right conditionr3 ← r3 ← r1r1 op r2 op r2r1r1 ← r5 op r4 ← r5 op r4 Write after read, WARWrite after read, WAR

Output DependenceOutput Dependencer3r3 ← r1 op r2 ← r1 op r2r5 ← r5 ← r3r3 op r4 op r4r3 r3 ← r6 op r7← r6 op r7 Write after Write, WAW, use in Write after Write, WAW, use in

betweenbetween

47

Register Dependencies

Control Dependence:Control Dependence:

if ( condition1 ) {if ( condition1 ) {

r3 = r1 op r2;r3 = r1 op r2;

}else{}else{ see the jump here? see the jump here?

r5 = r3 op r4;r5 = r3 op r4;

} // end if} // end if

write( r3 );write( r3 );

48

Register Renaming Only Only data dependence data dependence is a is a real real

dependence, dependence, hence called hence called true dependencetrue dependence

Other dependences are artifacts of Other dependences are artifacts of insufficient resourcesinsufficient resources, generally , generally insufficient registersinsufficient registers

This means: if additional registers were This means: if additional registers were available, then replacing some of these available, then replacing some of these conflicting registers with new ones conflicting registers with new ones could make the conflict disappear!could make the conflict disappear!

Anti-Anti- and and Output-Output-Dependences are indeed Dependences are indeed such such falsefalse dependences dependences

49

Register Renaming

OriginalOriginal Code: Code:

L1:L1: r1 ← r2 op r3r1 ← r2 op r3

L2:L2: r4 ← r1 op r5r4 ← r1 op r5

L3:L3: r1 ← r3 op r6r1 ← r3 op r6

L4:L4: r3 ← r1 op r7r3 ← r1 op r7

Dependences Dependences beforebefore::

Lx Ly which dependence?Lx Ly which dependence?

50

Register Renaming OriginalOriginal Code: Code: NewNew Code, Code, afterafter adding adding

regs:regs:

L1:L1: r1 ← r2 op r3r1 ← r2 op r3 r10 ← r2 op r30 –- r30 insteadr10 ← r2 op r30 –- r30 instead

L2:L2: r4 ← r1 op r5r4 ← r1 op r5 r4 ← r10 op r5 –- r10 insteadr4 ← r10 op r5 –- r10 instead

L3:L3: r1 ← r3 op r6r1 ← r3 op r6 r1 ← r30 op r6r1 ← r30 op r6

L4:L4: r3 ← r1 op r7r3 ← r1 op r7 r3 ← r1 op r7r3 ← r1 op r7

Dependences Dependences beforebefore:: Dependences Dependences afterafter::

L1, L2 true-Dep with r1L1, L2 true-Dep with r1 L1, L2 true-Dep with r10L1, L2 true-Dep with r10

L1, L3 output-Dep with r1L1, L3 output-Dep with r1 L3, L4 true-Dep with r1L3, L4 true-Dep with r1

L1, L4 anti-Dep with r3L1, L4 anti-Dep with r3

L3, L4 true-Dep with r1L3, L4 true-Dep with r1



51

Register RenamingWith these additional --or renamed-- regs, With these additional --or renamed-- regs,

the new code could possibly run in half the new code could possibly run in half the time!the time!

First: Compute into r10 instead of r1, but First: Compute into r10 instead of r1, but you need to have the additional register you need to have the additional register r10; r10; no time penalty!no time penalty!

Also: Compute into r30 instead of r3, if r30 Also: Compute into r30 instead of r3, if r30 available; available; also no time penalty!also no time penalty!

Then the following regs are Then the following regs are livelive afterwards: afterwards: r1, r3, r4r1, r3, r4

While r10 and r30 are While r10 and r30 are don’t cares don’t cares afterwardsafterwards

52

Score BoardScore-board is an array of HW programmable bits Score-board is an array of HW programmable bits sb[]sb[]

Manages other HW resources, specifically registersManages other HW resources, specifically registers

In the single-bit HW array sb[], every bit In the single-bit HW array sb[], every bit ii in in sb[i]sb[i] is is associated with a specific register, the one associated with a specific register, the one identified by identified by ii , i.e. , i.e. rrii

Association is by index, i.e. by name: Association is by index, i.e. by name: sb[i]sb[i] belongs to belongs to reg reg rrii

Only if Only if sb[i] = 0sb[i] = 0, does that register , does that register ii have have valid datavalid data

If If sb[i] = 0 sb[i] = 0 then register then register rrii is is NOT in process of being NOT in process of being writtenwritten

If bit If bit ii is set, i.e. if is set, i.e. if sb[i] = 1sb[i] = 1, then that register , then that register rrii

is reservedis reserved

Initially all Initially all sb[*]sb[*] are free to use, i.e. set to 0 are free to use, i.e. set to 0

53

Score BoardExecution constraints:Execution constraints:

rrdd ← r ← rss op r op rtt

if if sb[s]sb[s] or if or if sb[t]sb[t] is set → RAW dependence, is set → RAW dependence, hence stall the computation; wait until hence stall the computation; wait until both both rrss and and rrtt are available are available

if if sb[d]sb[d] is set→ WAW dependence, hence stall is set→ WAW dependence, hence stall the write; wait until the write; wait until rrdd has been used; SW has been used; SW can sometimes determine to use another can sometimes determine to use another register instead of register instead of rrdd

Else, if none of the 3 registers are in use, Else, if none of the 3 registers are in use, dispatch the instruction immediatelydispatch the instruction immediately

54

Score BoardTo allow To allow out of order (ooo) executionout of order (ooo) execution, , upon computing the value of rupon computing the value of rdd

Update Update rrdd, and clear , and clear sb[d]sb[d]

For uses (references), HW may use any For uses (references), HW may use any register register ii, whose , whose sb[i]sb[i] is 0 is 0

For definitions (assignments), HW may set For definitions (assignments), HW may set any register any register jj, whose , whose sb[j]sb[j] is 0 is 0

Independent of original order, in which Independent of original order, in which source program was writtensource program was written, i.e. possibly ooo

55

References1.1. The Humble Programmer: The Humble Programmer:

http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmhttp://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmll

2.2. Algorithm Definitions: Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizationshttp://en.wikipedia.org/wiki/Algorithm_characterizations

3.3. http://en.wikipedia.org/wiki/Moore's_lawhttp://en.wikipedia.org/wiki/Moore's_law

4.4. C. A. R. HoareC. A. R. Hoare’’s comment on readability: s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdfpdf

5.5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16Construction, Volume 21, Number 7, July 1986, pp 11-16

6.6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/turing/

7.7. Linux design: Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htmhttp://www.livinginternet.com/i/iw_unix_gnulinux.htm

8.8. Words of wisdom: http://www.cs.yale.edu/quotes.htmlWords of wisdom: http://www.cs.yale.edu/quotes.html

9.9. John von Neumann’s computer design: A.H. Taub (ed.), “Collected John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963Co., New York 1963

CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

Documents

Transcript of CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”