Faster unicores are still needed
description
Transcript of Faster unicores are still needed
![Page 1: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/1.jpg)
1
Faster unicores are still needed
André Seznec
INRIA/IRISA
![Page 2: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/2.jpg)
2
DAL: Defying Amdahl’s Law
• ERC advanced grant to A. Seznec (2011-2016)
DAL objective:
« Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020
General Purpose manycore »
![Page 3: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/3.jpg)
3
Multicores are everywhere
• Multicores in servers, desktop, laptops 2-4-8-12 O-O-O cores
• Multicores in smart phones, tablets 2-4-(not that simple) cores
• Manycores for niche markets 48-80-100 simple cores
Tilera, Intel Phi
![Page 4: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/4.jpg)
4Multicore/multithread for everyone
• End-user : improved usage comfort Can surf on the web and hear MP3
• Parallel performance for the masses? Very few (scalable) mainstream // apps
Graphics Niche market segments
![Page 5: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/5.jpg)
5
No parallel software bonanza
in the near future
• Inheritage of sequential legacy codes
• Parallelism is not cost-effective for most apps
• Sequential programming will remain dominant
![Page 6: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/6.jpg)
6 Inheritage of sequential legacy codes
• Software is more resilient than hardware Apps are surviving/evolving for years, often
decades Very few parallel apps now
• Unlikely redevelopment of parallel apps from scratch
• Computing intensive sections will be parallelized But significant code sections will remain sequential
![Page 7: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/7.jpg)
7
Parallelism is not cost-effective
for most apps
• Why parallelism ? Only for performance
• But costly: Difficult, man-time consuming, error prone Poorly portable: functionality and
performance
![Page 8: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/8.jpg)
8
Sequential programming will
remain dominant
Just easier The « Joe » programmer Portability, maintenance, debug
+ compiler to parallelize + parallel libraries + software components (developped by
experts)
![Page 9: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/9.jpg)
9
Looking backwards
![Page 10: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/10.jpg)
102002: The End of the Uniprocessor Road
• Power and temperature walls: Stopped the frequency increase
• 2x transistors: 5 %? 10 % ? perf. (if any)
economical logic : buy smaller chips !
IC industry needs to sell new (expensive) chips:
Marketing:
« You need hyperthreading, 2, 4, 8 cores »
![Page 11: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/11.jpg)
11
Marketing multicores to the masses2002- ..
GREAT !!
SMT Dual-core
SMT
Quad-core
SMT
![Page 12: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/12.jpg)
12
And now ?
The end user is not such a fool ..
![Page 13: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/13.jpg)
13
Following the trend: 2020
• Silicon area, power envelope ≈ 100 Nehalem class cores
or
≈ 1,000 simple cores
(VLIW, in-order superscalar)
![Page 14: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/14.jpg)
14
Amdahl’s Law“Cannot run faster than sequential part”
seq. parallel
![Page 15: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/15.jpg)
15OK, parallel applications do not scale
• Our recent study on parallel application scaling:
• In general: bp> -1 : sublinear scaling
• Sometimes: bs > 0 : sequential part increases
Execution time Input set Processor number
![Page 16: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/16.jpg)
16But let us use a naive (overoptimistic) model
• A parallel application:
Parallel section: can use 1000 processors
Sequential section: run on a single processor
SEQ: constant fraction of sequential code
linear speed-up
![Page 17: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/17.jpg)
17Complex cores against simple cores
• CC: 100 complex vs SC :1000 simple cores
with complex 2X faster than simple
if SEQ > 0.8 % then CC > SC
![Page 18: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/18.jpg)
18
And hybrid SC + CC ?
CC_SC: 50 complex 500 simple
if SEQ> 0.2% then CC_SC > SC
![Page 19: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/19.jpg)
19
And if ..
• Use a huge amount of resource for a single core:
10X the area of the complex core
10X the power of the complex core
Use all the uniprocessor techniques Very wide issue (8 – 16 ?), Ultimate frequency ( « heat
and run »), Helper threads, Value prediction
Invent new techniquesUltra Complex cores
![Page 20: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/20.jpg)
20
DAL architecture proposition
• Heterogeneous architecture: A few ultra complex cores
to enable performance on sequential codes and/or critical sections
A « sea » of simple cores for parallel sections
![Page 21: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/21.jpg)
21
For the naive model
« DAL » : UC_SC
5 ultra complex cores + 500 simple cores
• If SEQ > 0.13 % then « DAL » > SC
• « DAL » always better than UC, CC, CC_SC
![Page 22: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/22.jpg)
22Need for research on faster unicores
• Silicon area is 2nd order issue can use the area of 10 complex cores
• Power/energy is 2nd order issue
can use the power of 10 complex cores
![Page 23: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/23.jpg)
23
On going work:Revisiting Value Prediction
with Arthur Pérais
![Page 24: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/24.jpg)
24
Value prediction ?Lipasti et al, Gabbay and Mendelson 1996
Basic idea: Eliminate (some) true data dependencies through
predicting instruction results
I0 I1 I3 +2
+3 +1I4 I5
+3
![Page 25: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/25.jpg)
25Value Prediction:
• Large body of research 96-02
• Quite efficient: Surprisingly high number of predictable
instructions
• Not implemented so far: High cost : is it still relevant now ? High penalty on misp.: don’t lose all the
benefit
![Page 26: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/26.jpg)
26
Last Value Predictor
• Just predict the last produced value
Set Associative Table Use confidence counters
Analogy with PC-based branch prediction
![Page 27: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/27.jpg)
27
Stride value predictor
• Add last value + (last difference)
PC +
Analogy with stride prefetcher, but also with loop predictor
![Page 28: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/28.jpg)
28
Finite Context Method predictors
Use history of the last values by the instruction
PC
Analogy with local history branch predictor
![Page 29: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/29.jpg)
29
And global value history
• Just no sense ! Need the history of the last instructions
Too late !!
• But global branch history !?! ITTAGE is the state-of-the-art indirect
branch predictor !! And it predicts values !
branch
![Page 30: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/30.jpg)
30
ITTAGE
pc h[0:L1]
=? =? =?
prediction
pc pc h[0:L2] pc h[0:L3]
3232 1 32 1 32 1
32
32Tagless base Predictor
VTAGE
Longest matching component provides the prediction
![Page 31: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/31.jpg)
31
The repair issue on misprediction
I0 I1 I3 I4 I5
misprediction
![Page 32: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/32.jpg)
32
Pipeline squash
• Acts as on exception, branch misprediction
• Very high penalty
I0 I1 I3 I4 I5
![Page 33: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/33.jpg)
33
Selective replay
• Cancel all dependent instructions, but save the others
• Very complex to implement: Unlimited dependence chains
I0 I1 I3 I4 I5
![Page 34: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/34.jpg)
34
Critical path
• Predicted value needed late in the pipeline: Disptach time is sufficient
• Except that:
![Page 35: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/35.jpg)
35
A FCM implementation issue
PC
Spe
cula
tive
Win
dow
Must take the last local values
Might be a critical path
![Page 36: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/36.jpg)
36Critical path on the stride value predictor
PC +
Spe
cula
tive
Win
dow
Stride AND spec. last valuemust be high confidence
Can be reused on the next cycle
![Page 37: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/37.jpg)
37
Experiments
• 8-way superscalar, deep pipeline
• Use prediction only on high confidence 3-bit counters + saturated + reset
![Page 38: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/38.jpg)
38
Squashing
![Page 39: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/39.jpg)
39
Selective replay
![Page 40: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/40.jpg)
40High confidence through probabilistic counters
• Need for very high confidence: 95 % accuracy unsufficient >> 99 % needed
TRADING ACCURACY AGAINST COVERAGE
• Saturation with only very low probability 1/32, 1/256
![Page 41: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/41.jpg)
41
Squashing
![Page 42: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/42.jpg)
42
And hybrids
![Page 43: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/43.jpg)
43
Current status
• All value predictors amenable to very high confidence No complex selective repair needed
• No need for local value prediction No complex critical path in the local
value predictor
![Page 44: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/44.jpg)
44
On going work:Selective Prediction of Predicated
Instructions
with Nathanael Prémillieu
![Page 45: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/45.jpg)
45Who cares about predicated instructions ?
• CMOV in all ISA
• ARM, Itanium : All instructions are predicated
out-of-order execution: just a nightmare
![Page 46: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/46.jpg)
46
Mapping Table
I1: R1 R2, R3 (p)
I2: R4 R1, R2
Before renaming:
After renaming:
I1: P1 P15, P22 (p)
I2: P13 ???, P15
The multiple definition problem
![Page 47: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/47.jpg)
47
After renaming:
I1a: P1 P15, P22
I1b: P27 (p) ? P1, P11
I2: P13 P27, P15
Expansion/Serialization
• Create an extra instruction
• Force I1bI2 dependency
![Page 48: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/48.jpg)
48
Aggressive serialization
I1: P18 (p) ? (op P15, P22) : P23
I2: P13 P18, P15
• No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network
• Force I1I2 dependency
![Page 49: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/49.jpg)
49
Predicting the predicates
• branch history or branch+predicate history to predict the predicates
Eliminate multiple definitions
Predicate mispredictions become branch mispredictions
![Page 50: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/50.jpg)
50
Not that convincing !
![Page 51: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/51.jpg)
51
• Filter the predicate prediction
• Replay at rename time the mispredicted predicates
![Page 52: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/52.jpg)
52
![Page 53: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/53.jpg)
53
• Predicate prediction + filtering allows:
Better performance
Without aggressive out-of-order implementation
• Current compilers « shy » on predication usage
might be worth to reconsider
![Page 54: Faster unicores are still needed](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e95550346895dbc3dd2/html5/thumbnails/54.jpg)
54
Conclusion
Faster cores are needed:
Amdahl’s law,
Uniprocessor workload
Silicon, power, etc are available:
Just grab the resource from the rest of the system
Do research as if (area, power) was not a constraint:
Then, take into account the constraints
(or somebody else will manage to do it)