Faster unicores are still needed

André Seznec

INRIA/IRISA

DAL: Defying Amdahl’s Law

• ERC advanced grant to A. Seznec (2011-2016)

DAL objective:

« Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020

General Purpose manycore »

Multicores are everywhere

• Multicores in servers, desktop, laptops 2-4-8-12 O-O-O cores

• Multicores in smart phones, tablets 2-4-(not that simple) cores

• Manycores for niche markets 48-80-100 simple cores

Tilera, Intel Phi

4Multicore/multithread for everyone

• End-user : improved usage comfort Can surf on the web and hear MP3

• Parallel performance for the masses? Very few (scalable) mainstream // apps

Graphics Niche market segments

No parallel software bonanza

in the near future

• Inheritage of sequential legacy codes

• Parallelism is not cost-effective for most apps

• Sequential programming will remain dominant

6 Inheritage of sequential legacy codes

• Software is more resilient than hardware Apps are surviving/evolving for years, often

decades Very few parallel apps now

• Unlikely redevelopment of parallel apps from scratch

• Computing intensive sections will be parallelized But significant code sections will remain sequential

Parallelism is not cost-effective

for most apps

• Why parallelism ? Only for performance

• But costly: Difficult, man-time consuming, error prone Poorly portable: functionality and

performance

Sequential programming will

remain dominant

Just easier The « Joe » programmer Portability, maintenance, debug

+ compiler to parallelize + parallel libraries + software components (developped by

experts)

Looking backwards

102002: The End of the Uniprocessor Road

• Power and temperature walls: Stopped the frequency increase

• 2x transistors: 5 %? 10 % ? perf. (if any)

economical logic : buy smaller chips !

IC industry needs to sell new (expensive) chips:

Marketing:

« You need hyperthreading, 2, 4, 8 cores »

Marketing multicores to the masses2002- ..

GREAT !!

SMT Dual-core

Quad-core

And now ?

The end user is not such a fool ..

Following the trend: 2020

• Silicon area, power envelope ≈ 100 Nehalem class cores

≈ 1,000 simple cores

(VLIW, in-order superscalar)

Amdahl’s Law“Cannot run faster than sequential part”

seq. parallel

15OK, parallel applications do not scale

• Our recent study on parallel application scaling:

• In general: bp> -1 : sublinear scaling

• Sometimes: bs > 0 : sequential part increases

Execution time Input set Processor number

16But let us use a naive (overoptimistic) model

• A parallel application:

Parallel section: can use 1000 processors

Sequential section: run on a single processor

SEQ: constant fraction of sequential code

linear speed-up

17Complex cores against simple cores

• CC: 100 complex vs SC :1000 simple cores

with complex 2X faster than simple

if SEQ > 0.8 % then CC > SC

And hybrid SC + CC ?

CC_SC: 50 complex 500 simple

if SEQ> 0.2% then CC_SC > SC

And if ..

• Use a huge amount of resource for a single core:

10X the area of the complex core

10X the power of the complex core

Use all the uniprocessor techniques Very wide issue (8 – 16 ?), Ultimate frequency ( « heat

and run »), Helper threads, Value prediction

Invent new techniquesUltra Complex cores

DAL architecture proposition

• Heterogeneous architecture: A few ultra complex cores

to enable performance on sequential codes and/or critical sections

A « sea » of simple cores for parallel sections

For the naive model

« DAL » : UC_SC

5 ultra complex cores + 500 simple cores

• If SEQ > 0.13 % then « DAL » > SC

• « DAL » always better than UC, CC, CC_SC

22Need for research on faster unicores

• Silicon area is 2nd order issue can use the area of 10 complex cores

• Power/energy is 2nd order issue

can use the power of 10 complex cores

On going work:Revisiting Value Prediction

with Arthur Pérais

Value prediction ?Lipasti et al, Gabbay and Mendelson 1996

Basic idea: Eliminate (some) true data dependencies through

predicting instruction results

I0 I1 I3 +2

+3 +1I4 I5

25Value Prediction:

• Large body of research 96-02

• Quite efficient: Surprisingly high number of predictable

instructions

• Not implemented so far: High cost : is it still relevant now ? High penalty on misp.: don’t lose all the

benefit

Last Value Predictor

• Just predict the last produced value

Set Associative Table Use confidence counters

Analogy with PC-based branch prediction

Stride value predictor

• Add last value + (last difference)

Analogy with stride prefetcher, but also with loop predictor

Finite Context Method predictors

Use history of the last values by the instruction

Analogy with local history branch predictor

And global value history

• Just no sense ! Need the history of the last instructions

Too late !!

• But global branch history !?! ITTAGE is the state-of-the-art indirect

branch predictor !! And it predicts values !

branch

ITTAGE

pc h[0:L1]

=? =? =?

prediction

pc pc h[0:L2] pc h[0:L3]

3232 1 32 1 32 1

32Tagless base Predictor

Longest matching component provides the prediction

The repair issue on misprediction

I0 I1 I3 I4 I5

misprediction

Pipeline squash

• Acts as on exception, branch misprediction

• Very high penalty

I0 I1 I3 I4 I5

Selective replay

• Cancel all dependent instructions, but save the others

• Very complex to implement: Unlimited dependence chains

I0 I1 I3 I4 I5

Critical path

• Predicted value needed late in the pipeline: Disptach time is sufficient

• Except that:

A FCM implementation issue

Must take the last local values

Might be a critical path

36Critical path on the stride value predictor

Stride AND spec. last valuemust be high confidence

Can be reused on the next cycle

Experiments

• 8-way superscalar, deep pipeline

• Use prediction only on high confidence 3-bit counters + saturated + reset

Squashing

Selective replay

40High confidence through probabilistic counters

• Need for very high confidence: 95 % accuracy unsufficient >> 99 % needed

TRADING ACCURACY AGAINST COVERAGE

• Saturation with only very low probability 1/32, 1/256

Squashing

And hybrids

Current status

• All value predictors amenable to very high confidence No complex selective repair needed

• No need for local value prediction No complex critical path in the local

value predictor

On going work:Selective Prediction of Predicated

Instructions

with Nathanael Prémillieu

45Who cares about predicated instructions ?

• CMOV in all ISA

• ARM, Itanium : All instructions are predicated

out-of-order execution: just a nightmare

Mapping Table

I1: R1 R2, R3 (p)

I2: R4 R1, R2

Before renaming:

After renaming:

I1: P1 P15, P22 (p)

I2: P13 ???, P15

The multiple definition problem

After renaming:

I1a: P1 P15, P22

I1b: P27 (p) ? P1, P11

I2: P13 P27, P15

Expansion/Serialization

• Create an extra instruction

• Force I1bI2 dependency

Aggressive serialization

I1: P18 (p) ? (op P15, P22) : P23

I2: P13 P18, P15

• No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network

• Force I1I2 dependency

Predicting the predicates

• branch history or branch+predicate history to predict the predicates

Eliminate multiple definitions

Predicate mispredictions become branch mispredictions

Not that convincing !

• Filter the predicate prediction

• Replay at rename time the mispredicted predicates

• Predicate prediction + filtering allows:

Better performance

Without aggressive out-of-order implementation

• Current compilers « shy » on predication usage

might be worth to reconsider

Conclusion

Faster cores are needed:

Amdahl’s law,

Uniprocessor workload

Silicon, power, etc are available:

Just grab the resource from the rest of the system

Do research as if (area, power) was not a constraint:

Then, take into account the constraints

(or somebody else will manage to do it)

Faster unicores are still needed

Documents

Transcript of Faster unicores are still needed

Still Faster, Still Stronger, Still More Reliable!

Faster Faster Faster! Datamarts with Hive at Yahoo

The Agricultural Revolution Britain needed more food Britain needed more food Farms were still run on the medieval strip system Farms were still run on.

HNP DISCUSSION PAPER Immunization in Indiasiteresources.worldbank.org/HEALTHNUTRITIONANDPOPULATION/... · 2006-04-05 · where improvements are still needed. In other words, immunization

The quest for financial discipline at local government ... · attempt to generate urgently needed cash. Despite this cash injection, the council still needed another R 100 million

IS THIS LAW STILL NEEDED AFTER 75 YRS OF INDEPENDENCE: …

Websites are Still Needed in the Social Media World!

Are Printed Bulletins Still Needed? - Effective Church Communications

Are web managers still needed when everyone is a web 'expert'?

Enrollment Items still needed - bethel.k12.oh.us · NEW ELEMENTARY STUDENT ENROLLMENT CHECKLIST . NAME . Thank you for taking time to register your child. We still need the following

Faster! Faster!

The Thames Gateway: still needed, but still a government priority? Ros Dunn Chief Executive

needed? buildsystems still Are embedded [ ] Maybe [ ] No ...

download.microsoft.comdownload.microsoft.com/.../Files/710000003963/ABM_W… · Web viewABM needed to find ways to deploy IT services faster, anywhere in the world, for less money.

The Lamplighter - BivocationalCall Jim Wyble at 225-939-6842 for projects still needed and dates and times people are needed. Right now they are sanding sheetrock and prepping for

Written procedures for ELEOT feedback still needed....Written procedures for ELEOT feedback still needed. (Plan developed for ELEOT observation schedule of every teacher every month.

La Tourangelle - Apparel Softwaresoftengine.com/wp-content/uploads/Customer_Sucess_Story_LaTourangelle... · In addition to SAP Business One La Tourangelle still needed Softengine

PERUVIAN PRIMARY EDUCATION: IMPROVEMENT STILL …lasa.international.pitt.edu/Lasa2001/HuntBarbara.pdf · PERUVIAN PRIMARY EDUCATION: IMPROVEMENT STILL NEEDED ... at primary level.1

SPECIAL MUSIC IS NEEDED - stjohnucclincolnillinois.com … · Special music is still needed for our worship services in August. If you would like to provide that Special Music or

Getting to insight faster - Deloitte US...Getting to insight faster | The impetus for change The impetus for change Dealing with so much data is still a relatively new endeavour, and