VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · Pentium® proc...
Transcript of VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · Pentium® proc...
-
Vlsi_03_2005.ppt/April_2005 1
VLSI Trends in MicroarchitecturePast, present and future
TAU universityJanuary 24, 2006
Uri Weiser
-
Vlsi_03_2005.ppt/April_2005 2
Agenda
Microarchitecture– VLSI– Trends Past and present:
» Pipeline, superpipeline» Out of Order» Branch prediction» Caches» Trace cache» Threads and Chip Multiprocessing
– Future» Asymmetric» Accelerators
-
Vlsi_03_2005.ppt/April_2005 3
“[In the beginning] we had little idea of what we had started. ...I remember... saying, ‘Okay, we’ve done integrated circuit. What do we do next?”
Gordon E. Moore
-
Vlsi_03_2005.ppt/April_2005 4
TRENDSIN
VLSI
Sources:Shekhar BorkarUri Weiser
-
Vlsi_03_2005.ppt/April_2005 5
Process TechnologyProcess Technology1.51.5µµ 1.01.0µµ 0.80.8µµ 0.60.6µµ 0.350.35µµ 0.250.25µµ 0.180.18µµ 0.130.13µµ
Intel386Intel386™™ DX DX ProcessorProcessor
Intel486Intel486™™ DX DX ProcessorProcessor
PentiumPentium®®ProcessorProcessor
PentiumPentium®® Pro Pro ProcessorProcessor
PentiumPentium®® II II ProcessorProcessor
PentiumPentium®® 4 4 ProcessorProcessor
PentiumPentium®® III III ProcessorProcessor
ProcessorProcessor
Technology trendTrends
-
Vlsi_03_2005.ppt/April_2005 6
Performance History
10
100
1,000
10,000
1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ
Freq
(Mhz
)
Freq (uArch)Freq (Process)
18.3X
3.3X
i486Pentium® proc
Pentium® II & III proc
1.0u-0.18u, 1989-2001
Frequency increased 61X
1. 18.3X due to process technology
2. Additional 3.3X due to uArch
1
10
100
1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ
Rel
ativ
e Pe
rfor
man
ce Perf due to uArchitecture
Perf due to Freq
14X
7X
i486Pentium® proc
Pentium® II & III proc
Pentium® 4 proc
Performance increased ~100X
1. 14X due to process tech
2. Additional 7X due to uArch & design
Trends
-
Vlsi_03_2005.ppt/April_2005 7
Process Technology: Minimum Feature Size
Source: Intel, SIA Technology RoadmapSource: Intel, SIA Technology Roadmap0.01
Feature SizeFeature Size(microns)(microns)
0.1
1
10
’68 ’71 ’76 ’80 ’84 ’88 ’92 ’96 ’00 ’04 ’08
IntelSIA
Trends
-
Vlsi_03_2005.ppt/April_2005 8
Transistors on a Chip
400480088080
8085 8086286
386486 Pentium® proc
P6
0.001
0.01
0.1
1
10
100
1000
1970 1980 1990 2000 2010Year
Tran
sist
ors
(MT)
2X growth in 1.96 years!
Transistors on a chip doubled every two yearsTransistors on a chip doubled every two years
Pentium® 4 proc
Trends
-
Vlsi_03_2005.ppt/April_2005 9
Die Size Growth
40048008
80808085
8086 286386
486 Pentium ® procP6
1
10
100
1970 1980 1990 2000 2010Year
Die
sid
e (m
m) =
Are
a1/2
Die size grows? Is it saturated?Die size grows? Is it saturated?
Trends
-
Vlsi_03_2005.ppt/April_2005 10
Frequency
P6Pentium ® proc
48638628680868085
8080800840040.1
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Freq
uenc
y (M
hz)
Doubles every 2 years
Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years
Trends
-
Vlsi_03_2005.ppt/April_2005 11
Frequency of Operation
PPro-300PPro-225
486DX2-66
386DX-33 486DX-33
8080
80868085
8088
80286
386DX-16486DX-25
486DX-50PP-66
PP-90
1
10
100
1000
1970 1975 1980 1985 1990 1995 2000Years
Freq
uenc
yM
HZ
IntelPPCOther
Alpha
Trends
-
Vlsi_03_2005.ppt/April_2005 12
Frequency of Operation (cont.)
PPro-300
486DX2 PP-66PP-90
PP-100PP-120
PPro-225
PPro-180PPro-150/PPC604-150
PP-133/PPC604-133
PPC601-100
PPC620C-200
A21064A-275A21164-300
M-R4400-200
0
50
100
150
200
250
300
350
400
1992 1993 1994 1995 1996 1997 1998Years
Freq
uenc
y (M
Hz)
Intel
PPC
Other
A21164-333
A21164A-417
Trends
-
Vlsi_03_2005.ppt/April_2005 13
Brainiacs and Speed demons
Source: ISCA 95, p. 174
1.6
0.57 0.6
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350
Speed demons
Brai
niac
sSPECInt92 = 400
300
200
SPECInt92 = 10050
MHz
ALPHA
X86
PowerPC
PENTIUM
PENTIUM PRO
21164
21064
1.0
0.5
1.5
2.0
50 100 150 200 250 300 350 400
1
SPEC
Int9
2 / M
Hz
Trends
-
Vlsi_03_2005.ppt/April_2005 14
Trends of Future Processors
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
0 500 1000 1500 2000 2500
M HZ
SPEC
int/1
00M
HZ
PII, PIIISPECInt = 60SPECint = 40
SPECint = 27= 18
= 12= 8
= 4
SPECInt = 90
P IV
P M
Trends
-
Vlsi_03_2005.ppt/April_2005 15
1
10
100
1000
1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ
Wat
ts/c
m2
i386i486
Pentium ® processor Pentium Pro ® processor
Pentium II ® processor Pentium III ® processor
Power density continues to get worse
Hot plate
Nuclear ReactorRocketNozzleSun’s
Surface
Trends
-
Vlsi_03_2005.ppt/April_2005 16
On Die Cache Memory
Core Logic
Cache
Core Logic
Cache
Core Logic
Cache
0.13μ
0.18μ
0.10μ
Pentium III & 4
Pentium IIIPentium IIPentium
ProPentium
0%
20%
40%
60%
80%
100%
0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.10μ
Cache % offull chiparea ?
Larger % of die area will be memoryLarger % of die area will be memory
Trends
-
Vlsi_03_2005.ppt/April_2005 17
Process trend – the theory (cont)Performance driven era vs. Power aware era
Processes
“Old” processes
Pow
er ra
tioProcess generation p vs. p-1
@ same die area
n1µ
n+1 n+2 n+30.35µ
“new” processes
m0.25µ
m+1 m+2 m+30.09µ
1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistors
Power increase/generation: 1.6X
Power aware era(Performance within power envelop)
1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistor
Power increase/generation: 1.1X
Performance driven era(no power constrains)
Reference: Eric Spargle
1.0
2.0
3.0
4.0
Trends
-
Vlsi_03_2005.ppt/April_2005 18
Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)
Processor generation k vs. k-1 compacted @ the same process technology
Power
0
1
2
3
4
1.5 1 0.7 0.5 0.35 0.18Technology Generation
Gro
wth
(X)
Perf/power delta ratio
1 : >3Performance
!Perf/power delta
ratio3 : 1
Trends
-
Vlsi_03_2005.ppt/April_2005 19
Microarchitecture
-
Vlsi_03_2005.ppt/April_2005 20
The Generic Processor
Datasupply
Instructionsupply
Executionengine
Sophisticated organization to “service” instructionsInstruction supply– Instruction cache– Branch prediction– Instruction decoder– ...
Execution engine– Instruction scheduler– Register files– Execution units– ...
Data supply– Data cache– TLB’s– …
Goal - Maximum throughput – balanced design
Microarchitecture
-
Vlsi_03_2005.ppt/April_2005 21
“The Core” - A Block Diagram
FPU ALU1 ALU2AGU
+MEM
n
2
DataCache
D-TLBReg.File
Scheduler
Instruction flowRead SourcesWrite DestinationWrite Bypass
Write Back
Microarchitecture
-
Vlsi_03_2005.ppt/April_2005 22
Parallelism Evolution
ProcessorElement
Instruction
Basic configuration Pipeline
VLIW
Superscalar - In order
Superscalar - Out of Order
a
...
PE
a ...
PE
b
PE
c
PE
n
PE
a ...
PE
d
PE
f
PE
n
b
c
e
PE
a a
PE
b c n
PE PE PE PE
Parallelism
-
Vlsi_03_2005.ppt/April_2005 23
PipelineBreak the work to smaller pieces
Increased throughput– increased # of completed instructions per cycle and reduces cycle time– Number of stages varies
– Small: 4-5 (Pentium), “Superpipeline” ~14 (Pentium Pro), “ultra-pipeline” ~25 (PIV)
Calls for good balancing among stages
I1I2
I3
F D E WF D E W
F D E WF D E W
F D E WF D E W
F: FetchD: DecodeE: ExecuteW: Write Back
0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)
1/4 IPC = 4 CPI
1 IPC = 1 CPIIPC = Instructions Per CycleCPI = Cycles Per Instructions
ExamplesIntel 486NS 32532
t
Parallelism
-
Vlsi_03_2005.ppt/April_2005 24
Pipeline Stalls
But there are “stalls” in the pipeline– “Data Hazards”: Data flow dependency (instructions output/input)
» Solved by: bypasses, renaming– “Control Hazards”: Control flow dependencies
» Solved by branch prediction– “Structural Hazards”: Limited resources– Other (Cache misses, long latency instructions, page faults….)
F D E WF D stall W
F D E WF E W
F D E W
F: FetchD: DecodeE: ExecuteW: Write Back
Estall
D
Data Flow stall(w/o bypass)
Control Flow stall
stallstallstallstall
Address Generation Interlock
stall
0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)t
Parallelism
-
Vlsi_03_2005.ppt/April_2005 25
Super ScalarPerforms more in a single cycle
Ideally, can multiply the throughput– But stall occurs more frequently
0 1 2 3 4 5 6 7 8 9 10 11 12F D E W
F D E W
F D E W
F D E WF D E W
F D E W
F D E W
F D E W
F D E W
F D E W
F D E WF D E WStall
2 IPC = 1/2 CPI
ExamplesIntel Pentium® Proc. Alpha 21164
t
Parallelism
-
Vlsi_03_2005.ppt/April_2005 26
Super PipelineSplit to shorter stages - allows higher frequency
Ideally, can (again) multiply the throughput, but– Stall penalties do not scale (e.g., control flow stall, cache misses)– Clock setup/hold reduces net cycle time - each instruction takes longer!
In the example above: 2X stages, but performance gain is
-
Vlsi_03_2005.ppt/April_2005 27
1 2 95 6 73 4 8
1 2 95 6 73 4 8
Out Of Order ExecutionIn Order Execution: instructions are processed in their program order.
– Limitation to potential Parallelism.OOO: Instructions are executed based on “data flow” rather than program orderBefore: src -> dest(1) load (r10), r21(2) mov r21, r31 (2 depends on 1)(3) load a, r11(4) mov r11, r22 (4 depends on 3)(5) mov r22, r23 (5 depends on 4)
After:(1) load (r10), r21; (3) load a, r11;
(2) mov r21,r31; (4) mov r11,r22;
(5) mov r22,r23;Usually highly superscalar
1F2F3F4F
1D2D3D4D
5F 5D 5W
1W2E3w4E5E
2W3W4W5E
1E2E3E4E5E
1E2E3E4E5E
1F2F3F4F
1D2D3D4D
5F 5D 5W
1W2E3E4E5E
2W3E4E5E
4E5E
1E2E3E4E5E
1E2E3E4E5E
4W5E
3W
In Order Processing
Out of Order Processing
In Order vs. OOO execution.Assuming:- Unlimited resources- 2 cycles load latency
Examples:Intel Pentium® II/III/4Compaq Alpha 21264
t
t
OOO
-
Vlsi_03_2005.ppt/April_2005 28
Out Of Order (cont.)Advantages
– Help exploit Instruction Level Parallelism (ILP)– Help cover latencies (e.g., cache miss, divide)– Artificially increase the Register file size (i.e. number of registers)– Superior/complementary to compiler scheduler
» Dynamic instruction window» Make usage of more registers than the Architecture Registers
Complex microarchitecture– Complex scheduler. Involves also
» Large instruction window» Speculative execution
– Requires reordering back-end mechanism (retirement) for:» Precise interrupt resolution» Misprediction/speculation recovery» Memory ordering
OOO
-
Vlsi_03_2005.ppt/April_2005 29
Speculation
-
Vlsi_03_2005.ppt/April_2005 30
Branch PredictionGoal - ensure enough instruction supply by correct prefetchingIn the past - prefetcher assumed fall-through
– Lose on unconditional branch (e.g., call)– Lose on frequently taken branches (e.g., loops)
Branch prediction– Predicts whether a branch is taken/not taken– Predicts the branch target address
Misprediction cost varies (higher w/ increased pipeline length)Typical Branch prediction rates: ~90%-96%
4%-10% misprediction,10-25 branches between mispredictions50-125 instructions between mispredictions
Misprediction cost increased with– Pipeline depth– Machine width
» e.g. 3 width x 10 stages = 30 inst flushed!
?
Speculation - BP
-
Vlsi_03_2005.ppt/April_2005 31
Target Array + Direction PredictionTarget and direction are predicted separately Tag may be partial
Branch IP
tag predicted target
predicted Target
Address
predicted direction(taken/not-taken)
hit / miss(indicates a branch)
DirectionPrediction
(for conditionalbranches only)
TargetPrediction
Speculation - BP
-
Vlsi_03_2005.ppt/April_2005 32
Speculative ExecutionExecution of instructions from a predicted (yet unsure) pathEventually, path may turn wrong.Advantages:
– Ensure instruction supply– Allow large scheduling window (for out of order)
Issues:– Misprediction cost– Misprediction recovery
Speculation - execution
-
Vlsi_03_2005.ppt/April_2005 33
Cache - Motivation & PrincipleMemory consumption is growing about 2X every 2 years
– Typical size: (Y2000) 64M-128M, (Y2002) 128M-256MCPU speed grows faster than memory and buses
– CPU/Bus grew from 1:1 to 6:1, and still growing486 Pentium P-II P-III P4 25-66MHz 66-233MHz 200-450MH 0.5-1.33GHz 1.4-2.4GHz33MHz 66MHZ 66-100MHz 133-200MHz 400MHz
– Memory: DRAM: 60-100ns (“10-16MHz”), Cost:
-
Vlsi_03_2005.ppt/April_2005 34
The Generic Processor
Executionengine
Datasupply
Instruction Cache
BTB Instruction fetch/decode
Rename
Scheduler
Trace/decodedcache
Speculation – Trace Cache
-
35Vlsi_03_2005.ppt/April_2005 35
Fetch bandwidthexample
A CB B CAtime
Dynamic instruction stream
o o o o
A
B
CD
B E
Control flow graphA, B, C are instruction blocks
Speculation – Trace Cache
-
36Vlsi_03_2005.ppt/April_2005 36
Trace Cache Concept
• Hold in the “instruction”cache the dynamic stream of the executed instructions
=> Trace cache acts as “branch predictor” + wide instructions supplier
Speculation – Trace Cache
-
37Vlsi_03_2005.ppt/April_2005 37
Trace Cache Overview
I Cache
Decoder
Trace Cache
FillUnit
address
A B C
A B CA B C
To Execution Core
Build Mode
Stream Mode
Speculation – Trace Cache
-
38Vlsi_03_2005.ppt/April_2005 38
Trace cache line
• Tag: identifies starting address of trace
• N instructions (potentially decoded)• Next address: next fetch address• path info: branch flags (T, NT), number
of branches, trace ends w/ branch?,…)
N Instructions Next address path infoPC Tag
Speculation – Trace Cache
-
Vlsi_03_2005.ppt/April_2005 39
Threads
-
Vlsi_03_2005.ppt/April_2005 40
Scalar Execution
Dependencies reduce throughput/utilization
Time
Threads
-
Vlsi_03_2005.ppt/April_2005 41
Superscalar Execution
Generally increases throughput, but decreases utilization
Time
Threads
-
Vlsi_03_2005.ppt/April_2005 42
Predication
Generally increases utilization, increases throughput less(much of the utilization is thrown away)
Time
Threads
-
Vlsi_03_2005.ppt/April_2005 43
CMP – Chip Multi-ProcessorTime
Low utilization / higher throughput
Threads
-
Vlsi_03_2005.ppt/April_2005 44
Blocked Multithreading
May increase utilization and throughput, but must switch when currentthread goes to low utilization/throughput section (e.g. L2 cache miss)
Time
Threads
-
Vlsi_03_2005.ppt/April_2005 45
Fine Grained Multithreading
Increases utilization/throughput by reducing impact of dependences
Time
Threads
-
Vlsi_03_2005.ppt/April_2005 46
Simultaneous Multithreading
Time
Increases utilization/throughput
Threads
-
Vlsi_03_2005.ppt/April_2005 47
Future
-
Vlsi_03_2005.ppt/April_2005 48
Gain
Frequency
Analog Circuit Paradigm
GBWP = Gain Bandwidth Product = constant @ a given technology
BW1Gain1
BW2Gain2
e.g. Gain1*BW1= Gain2*BW2
Future
-
Vlsi_03_2005.ppt/April_2005 49
Gain
Frequency
Analog Circuit Paradigm (cont.)
BW1 BWn Gain
Future
-
Vlsi_03_2005.ppt/April_2005 50
“Theory”Analog Gain Bandwidth Product (GBWP) is constant for a specific technology, this is also true for other “environments”…A computer structure can excel in performance for a specific application set but not at all applications (also true for benchmarks)a person can excel in several areas but not at all...…...
examples: benchmarks, application in coming foils people….
Future
-
Vlsi_03_2005.ppt/April_2005 51
Tuning for ApplicationsPerformance
“Applications”
Apps1 Appsn
Future
-
Vlsi_03_2005.ppt/April_2005 52
Provide Specialized “efficient”MIPS
Find a way to support the new performance requirements via an efficient “mechanism”A tailored solutions (to a specific application set) can provide an “efficient” MIPS via INTEGRATION, how?
Future
-
Vlsi_03_2005.ppt/April_2005 53
The Needthe environment
These days is the PC's 20th birthday – 835 Million PC sold 1981-2001– 138 million PCs in year 2001(IDC), 10X number of cars, 1.5X of
television sold annually– 2.2 Billion Email a day, 10X of the first class mail– 400 million on line users (200 in Sep99)– CPU performance improved ~8000X !!!
What will be the need for performance in the coming 20 years?What will be the technology progress in the coming 20 years? 10 years? 5 years?
Statistics courtesy of Gartner Dataquest, U.S. News & World Report, Jupiter Internet Population Model, and NUA Internet Surveys
Future
-
Vlsi_03_2005.ppt/April_2005 54
Windows XP examples that needs excessive performance:
- Movie Maker Video Indexing- Video smoothing
Example 1:Movie Maker Video Indexing
320*240, 30fps4X slower thanreal Time on CentrinoTM @1.6Ghz
1980x1080, 30fps~100X over CenrinoTM @1.6Mhz
Future
-
Vlsi_03_2005.ppt/April_2005 55
30 FPS (Enhanced)5 FPS (Jerky)
Example 2:Emulation of:Video smoothingVideo Enhancement
Video smoothing
352*240 pixelsCPU usage:70% of CentrinoTM @1.6Ghz
1980x1080, 30fps~21X over CenrinoTM @1.6Mhz
Future
-
Vlsi_03_2005.ppt/April_2005 56
The Need
Future
-
Vlsi_03_2005.ppt/April_2005 57
The need: Build a Panorama
M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003Performance: >30min P4 3GHz
Simplified capabilities at Microsoft Digital Image Suite 10 ($129.95)
Future
-
Vlsi_03_2005.ppt/April_2005 58
Future
-
Vlsi_03_2005.ppt/April_2005 59
CPU Usagean example(measured on IBM X20)
Streaming vs. General purpose
General purpose usageExcel and Outlook -->Burst need for MIPS
Streaming Processing6 low resolution videos-->Continuous need for MIPS
Avg.
Avg.
5min
~5min
Future
-
Vlsi_03_2005.ppt/April_2005 60
Process trend – the theory (cont)Performance driven era vs. Power aware era
Processes
“Old” processes
Pow
er ra
tioProcess generation p vs. p-1
@ same die area
n1µ
n+1 n+2 n+30.35µ
“new” processes
m0.25µ
m+1 m+2 m+30.09µ
1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistorsLeakage 30%
Power increase/generation: 1.6X
Power aware era(Performance within power envelop)
1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistorLeakage
-
Vlsi_03_2005.ppt/April_2005 61
Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)
Processor generation k vs. k-1 compacted @ the same process technology
Power
0
1
2
3
4
1.5 1 0.7 0.5 0.35 0.18Technology Generation
Gro
wth
(X)
Perf/power delta ratio
1 : >3Performance
!Perf/power delta
ratio3 : 1
Future
-
Vlsi_03_2005.ppt/April_2005 62
solution 1: CMP (Chip Multi-Processor)P
erfo
rman
ce
Power
penalty: MP
Conventional des
ign 1% performancefor 3% in power
One processor
2 CMP
3 CMP
4 CMP
*CMP = Symmetric General Purpose (GP) cores
CMP*
Future
-
Vlsi_03_2005.ppt/April_2005 63
solution 2: ACCMP (Asymmetric Cluster CMP)P
erfo
rman
ce
Power
penalty: Specialized MIPS
AC
CMP
~1% performancefor 1% in power
>3% performancefor 1% in power
Future
-
Vlsi_03_2005.ppt/April_2005 64
ACCMPWhat is the ACCMP?– On Die Asymmetric Clusters of cores– Efficient specialized MIPS clusters with
>3-4X performance/power over GP cores– Compatible ISA?
Penalties– Multi-Processing (tasks or threads)
Specialized MIPS
ACCMP is a solution that enables to continue (for a while) Moore’s performance law within the power envelop
Future
-
Vlsi_03_2005.ppt/April_2005 65
ACCMP
Specialized MIPS A Cluster
Host coreHost core
L2 $
Host ClusterGeneral Purpose
MIPS
Interconnect
Specialized MIPS B Cluster
Ext.Bus
Future
-
Vlsi_03_2005.ppt/April_2005 66
Future - Processors
• applications need• Specialized MIPS• Detached from the CPU core• Different engines• Mixture of Programmable and fixed function
• ?
VLSI Trends in Microarchitecture�Past, present and futureAgenda TRENDS�IN�VLSIPerformance HistoryProcess Technology: Minimum Feature SizeTransistors on a ChipDie Size GrowthFrequencyFrequency of OperationFrequency of Operation (cont.)Brainiacs and Speed demonsOn Die Cache MemoryProcess trend – the theory (cont) �Performance driven era vs. Power aware era�Processor roadmap trend – real life (cont) �Extension of Pollack’s Rule (Micro32, 1999)�MicroarchitectureThe Generic Processor“The Core” - A Block DiagramParallelism EvolutionPipelinePipeline StallsSuper ScalarSuper PipelineOut Of Order ExecutionOut Of Order (cont.)SpeculationBranch PredictionTarget Array + Direction PredictionSpeculative ExecutionCache - Motivation & PrincipleThe Generic ProcessorFetch bandwidth�exampleTrace Cache ConceptTrace Cache OverviewTrace cache lineThreadsScalar ExecutionSuperscalar ExecutionPredicationCMP – Chip Multi-ProcessorBlocked MultithreadingFine Grained MultithreadingSimultaneous MultithreadingFuture“Theory”Provide Specialized “efficient” MIPSThe Need�the environmentWindows XP examples �that needs excessive performance:�� - Movie Maker Video Indexing� - Video smoothing�The NeedThe need: Build a PanoramaProcess trend – the theory (cont) �Performance driven era vs. Power aware era�Processor roadmap trend – real life (cont) �Extension of Pollack’s Rule (Micro32, 1999)�solution 1: CMP (Chip Multi-Processor)solution 2: ACCMP (Asymmetric Cluster CMP)ACCMPACCMPFuture - Processors