CS184b: Computer Architecture (Abstractions and Optimizations)
description
Transcript of CS184b: Computer Architecture (Abstractions and Optimizations)
![Page 1: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/1.jpg)
Caltech CS184 Spring2003 -- DeHon1
CS184b:Computer Architecture
(Abstractions and Optimizations)
Day 5: April 14, 2003ILP 2
![Page 2: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/2.jpg)
Caltech CS184 Spring2003 -- DeHon2
Today
• ILP Limits• Practical Issues
– Finite size issues• Cost Scaling• Ultrascalar
![Page 3: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/3.jpg)
Caltech CS184 Spring2003 -- DeHon3
Limit Studies• Goal: understand how far you can go
– this case, how much ILP can find• Remove current/artificial limits
– do full renaming, arbitrary look ahead– perfect control prediction, memory disambiguation
• Careful with assumptions– can still be pessimistic– is there another way to do it?– Another way around the limitation?
![Page 4: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/4.jpg)
Caltech CS184 Spring2003 -- DeHon4
Available ILP
[Hennessy and Patterson 4.38e2/3.35e3]
![Page 5: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/5.jpg)
Caltech CS184 Spring2003 -- DeHon5
What do we achieve today?
• Pentium … < 1 instruction/cycle retired– But low cycle time– Time= CPI Instructions CycleTime
• Not seen attempts to issue more than 4 instructions/cycle – Much less sustain retire or more than 4
![Page 6: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/6.jpg)
Caltech CS184 Spring2003 -- DeHon6
Limit Effects
![Page 7: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/7.jpg)
Caltech CS184 Spring2003 -- DeHon7
Superscalar
IFDecodeQueue
EX
ALU
MPY
LD/ST
RF
RUUFetchWidth
Window Size
PhysicalRegisters
Issue Width
Number/TypesOf Functional Units
![Page 8: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/8.jpg)
Caltech CS184 Spring2003 -- DeHon8
Window Size (unlimited issue)
[Hennessy and Patterson 4.39e2/3.36e3]
There’s quite a bit of non-local parallelism.
![Page 9: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/9.jpg)
Caltech CS184 Spring2003 -- DeHon9
Window Size (Issue limited)
[64-issues Hennessy and Patterson 4.47e2/3.45e3]
![Page 10: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/10.jpg)
Caltech CS184 Spring2003 -- DeHon10
Operation Organization
• Consider Tree-structured calculation– freedom in ordering– consider:
• post-order traversal• by levels from leaves
– where is parallelism?– Storage cost?
![Page 11: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/11.jpg)
Caltech CS184 Spring2003 -- DeHon11
Window Size• How many instructions forward do we
look?– Only look at next = in-order issue
JohnsonFig. 3.9(32 issue window?)
![Page 12: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/12.jpg)
Caltech CS184 Spring2003 -- DeHon12
Branch Prediction
[Hennessey & Patterson Fig 3.38/e3]
![Page 13: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/13.jpg)
Caltech CS184 Spring2003 -- DeHon13
Window Cost?
• No one before you in the window writes a value you need
• Rsrci Rdsti-1; Rsrci Rdsti-2;…
• O(WS2) comparisons
![Page 14: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/14.jpg)
Caltech CS184 Spring2003 -- DeHon14
Cost?
• Anecdotal [Farrell, Fischer JSSC v33n5]– DEC 20-instruction queue– 4 instruction issue– (80 physical registers)– 10mm2 in 0.35m (300M2+)
• Compare: – 300 4-LUTs (w/ interconnect)– MIPS-X 32b CPU w/ 1KB memory = 68M2
– 600 MHz = 1.6ns
![Page 15: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/15.jpg)
Caltech CS184 Spring2003 -- DeHon15
Costs?• Both DEC and “Quantifying” (also DEC)
– appear to use a scoreboarded scheme to avoid– accept not issue until result computed?
• “Quantifyng” suggests:– wakeup time IW2WS2
• but assuming quadratic wire delay in length• (never buffer wire)
– but WS=F(IW)– certainly faster than linear time– A IW WS
![Page 16: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/16.jpg)
Caltech CS184 Spring2003 -- DeHon16
Registers• How many virtual registers needed?
[Hennessy and Patterson 4.43e2/3.41e3]
![Page 17: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/17.jpg)
Caltech CS184 Spring2003 -- DeHon17
Register Costs?
• First Order– area linear in number of registers– delay linear in number of registers
• Bank RF– maybe sublinear delay– at least square root number of registers
• wire delay sqrt of area
![Page 18: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/18.jpg)
Caltech CS184 Spring2003 -- DeHon18
RF and IW interaction
• Larger Issue (Decode)– want to read/retire more registers per cycle– RF ports = 3 IW [Op RdstRsrc1,Rsrc2]– A ports number– …and number of registers = F(IW)– A IW F(IW)
• RF grows faster than linear
![Page 19: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/19.jpg)
Caltech CS184 Spring2003 -- DeHon19
Bypass: Control
• Control comparison– every functional input (2 IW)– get input from
• every pipestage (d) from issue produce to wb• for every result producer (>IW)
• Total comparisons: dIW2
![Page 20: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/20.jpg)
Caltech CS184 Spring2003 -- DeHon20
Bypass: Interconnect• Linear layout
– bypass span functional units and RF– physical RF grows with IW
• read/write ports• more physical registers to support IW
– FU bypass muxes grows with IW• Consequently
– width grows with IW – cycle grow with IW?
![Page 21: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/21.jpg)
Caltech CS184 Spring2003 -- DeHon21
Bypass: Interconnect
• “Quantifying”– quadratic wire delay– (but asymptotically, we can buffer)– largest delay component calculated
• (>1ns for IW=8) [180nm]• IW=8 about 5-6 times IW=4
![Page 22: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/22.jpg)
Caltech CS184 Spring2003 -- DeHon22
Aliasing
• Do memory operations depend on one another?
• E.g.A[j+3]=x*x+y;Z=A[i-2]+A[i+2]
• Is A[i-2], A[i+2] another name for A[j+3]?
• E.g.*a++;*b+=3;*a++;d=*c+3;
• Are these operations all independent?
• Or do some name the same memory locaiton?
![Page 23: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/23.jpg)
Caltech CS184 Spring2003 -- DeHon23
Aliasing
[Hennessey & Patterson Fig 3.43/e3]
![Page 24: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/24.jpg)
Caltech CS184 Spring2003 -- DeHon24
…And now for something Completely Different
![Page 25: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/25.jpg)
Caltech CS184 Spring2003 -- DeHon25
Different Solution
• These assume Number of Regs > IW• If IW>R, different approach…
• From Henry, Kuszmaul, et. al.– ARVLSI’99– SPAA’99– ISCA’00
![Page 26: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/26.jpg)
Caltech CS184 Spring2003 -- DeHon26
Consider Machine
• Each FU has a full RF• Build network between FUs
– use network to connect produce/consume – user register names to configure
interconnect• Signal data ready along network
![Page 27: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/27.jpg)
Caltech CS184 Spring2003 -- DeHon27
Ultrascalar: concept model
![Page 28: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/28.jpg)
Caltech CS184 Spring2003 -- DeHon28
Ultrascalar concept
• Linear delay• O(1) register cost / FU• Complete renaming at each FU
– different set of registers– so when say complete RF at each FU,
that’s only the logical registers
![Page 29: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/29.jpg)
Caltech CS184 Spring2003 -- DeHon29
Ultrascalar: cyclic prefix
![Page 30: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/30.jpg)
Caltech CS184 Spring2003 -- DeHon30
Parallel Prefix• Basic idea is one we saw with adders• An FU will either
– produce a register (generate)– or transmit a register (propagate)– can do tree combining
• pair of FUs will either both propagate or will generate• compute function by pair in one stage• recurse to next stage• get log-depth tree network connecting producer and
consumer
![Page 31: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/31.jpg)
Caltech CS184 Spring2003 -- DeHon31
Ultrascalar: cyclic prefix
![Page 32: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/32.jpg)
Caltech CS184 Spring2003 -- DeHon32
Cyclic Prefix
• Gets delay down to log(WS)– w/ linear layout, delay still linear
• Issue into, retire from Window in order– serves
• rename• shared RF• issue• bypass• reorder
![Page 33: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/33.jpg)
Caltech CS184 Spring2003 -- DeHon33
Ultrascalar: layout
Register paths not growing.(p=0 tree!)Wide, but constant width
If Memory width <n area goes as n wire goes as n
![Page 34: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/34.jpg)
Caltech CS184 Spring2003 -- DeHon34
Ultrascalar: asymptotics• Assume M(n)<O(n)
– Area ~ nR2
– Delay ~ (n)R• Claim can do
– Area ~ nR– Delay ~ (nR)
• If memory grows faster, will dominate interconnect growth, hence area and delay– get extra term for memory growth (like Rent’s Rule)
![Page 35: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/35.jpg)
Caltech CS184 Spring2003 -- DeHon35
UltraScalar:
• 0.25 m• 128-window, 32 logical regs• 64b ops ?• 8 instruction fetch• delays <2ns [0.25m]
– commit, wakeup, schedule– wire delay dominate logic
• area ~2G2 (not include datapath)
![Page 36: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/36.jpg)
Caltech CS184 Spring2003 -- DeHon36
Solution for:
• Object/binary compatibility is paramount• Performance is King• Recompilation not an option• Cost (area, energy) is no object
![Page 37: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/37.jpg)
Caltech CS184 Spring2003 -- DeHon37
Friday
• …an alternative way to exploit ILP• rely on compiler and feedback
• [reminder: no lecture Wednesday]
![Page 38: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader036.fdocuments.in/reader036/viewer/2022081604/56813ff3550346895dab0f0f/html5/thumbnails/38.jpg)
Caltech CS184 Spring2003 -- DeHon38
(Semi?) Big Ideas• Good to look at
– Extremes (what can this possibly do?)– Sensitivity (how important is this to…)
• Balance• Size Matters• Interconnect delay dominate• As parameters grow
– watch tradeoffs– widely different solutions prevail in different points in
space (different asymptotes)