Multicore – The future of Computing Chief Engineer Terje Mathisen.

Multicore – The future of ComputingChief EngineerTerje Mathisen

Moore’s Law

«The number of transistors we can put on a chip will double every two years»– Originally from 1965, modified in 1975

– Up to around the turn of century this meant a doubling in performance every 18 months.

– Power has become the worst problem.

– Bipolar transistors->NMOS->CMOS->(lots of tweaks)->3D

– Voltage scaling

– Today, leakage current is a limiter

– Even CMOS transistors leak when they get really tiny

Moore's Law has held for 40 years

1975 1980 1985 1990 1995 2000 2005 2010 201510000

100000

1000000

10000000

100000000

1000000000

10000000000

Haswell: 5,6e9, 22nm

What could we use all the transistors for? Increase scalar performance

Increasingly more complicated cpus

Multiple cycles/instruction:– 8088 (29K) – 80286 (134K) – 80386 (275K)

Pipeline, one cycle/instruction– 80486 (1,2M)

Superscalar: Multiple instructions/cycle– Pentium (3,1M) (two in-order pipelines)

Out of order/superscalar/multithreaded– Pentium Pro/Pentium III/Pentium4/Core/etc (5,5M --> 5,6B)

Pentium4 had the fastest pipeline ever 3 Ghz clock

– Inner core ran at 2x, i.e. 6 Ghz

– Only simple instructions, like ADD/SUB/AND/OR

Guessing at branches– If (a > b) {...} else {…}

Mistakes were very costly, both in time and power – 10 to 200 wasted instructions each time the cpu guessed wrong!

Core 2: Multiple complicated cores

Running two individual processes in parallel causes fewer wasted instructions, leads to more power-efficient computing.– Shorter pipelines are better at branching

– Object-oriented programming uses many branches

Every two years: Double the number of cores– Core 2 –> Core 2 Duo -> Core 2 Quad

– Latest server cpus have up to 18 cores, using 5.6e9 transistors

Vector operations

SIMD: Work with more data in each instruction– SSE uses 16-byte vectors (4 float/2 double)

– AVX uses 32-byte vectors (8 float/4 double)

Each core can do two SSE operations/cycle– Quad cpu with 4*2*4 = 32 fp operations/cycle

– 64 Gflops @ 2 GHz, 100 Gflops @ 3+ GHz

– High-end AVX implementation doubles this, 12-18 cores add another multiplier

Other CPU architectures• Sun Sparc

• 2005: Niagara: 8 cores, 4 threads/core, low clock speed

• Multithreaded server workloads

• Oracle Sparc M7• 2014: 32 cores, 8 threads/core

• Optimized for DB operations

Other CPU architectures Sparc

– Multithreaded server workloads IBM/Sony Cell

– 2005: Playstation 3

– 1 PPE + 7-8 SPE cores, each capable of 25 Gflops/s

– Works on 16-byte vectors (4 float/2 double)

– ~200 Gflops SP -> 14 Gflops DP

– Special HPC version with 100+ Gflops DP

Other CPU architectures

Sun Sparc

IBM/Sony Cell

GPGPU– Graphics cards with semi-general fp pipelines

Intel Larrabee/Many Integrated Core /Xeon Phi Project started 2003

– Architecture review Oct 2006

Announced 2007– 64-bit

– x86 compatible

Similar to Pentium– Dual in-order pipelines

– More flexible mixing of instructions

Special graphics instructions, incl. scatter/gather– S/G are very useful for HPC applications

LRB cont. Even longer vectors

– Works with 64-byte blocks (16 float/8 double)

– Combined FMUL/FADD instruction

More than 50 cores on first product– 4 threads/core

– 16x2x51 = 1616 flops/cycle

– 1.3 Ghz core -> 2 Tflop (Seismic cluster is ~10 Tflops)

First product will be graphics coprocessor card

Will use the same 125 watts (max) as a single P4

New name: Many Integrated Core (MIC)/ Knights Corner/ Xeon Phi

Future directions

Heterogeneous cpus:– Maybe 2-4 Core2 + 20-60 Larrabee?

– Run single-threaded applications on Core, multi-threaded/vector-based on Xeon Phi. (2013 - Fastest computer in the world: Ivy Bridge+Phi)

– OS threads without fp operations can also use simple in-order LRB cores

Power-efficient processing– Both laptops/mobiles and servers are limited by power use

Simpler/slower cores with mostly in-order processing can use 80% less power

Conclusion

Multicore will give us an extra factor of ~10 increase in fp processing power– Most current forms of simulation becomes possible on a single

workstation with 2-4 cpus

MIPS/Watt is crucial– Easier to make many simpler cores than one complex

– Less wasted work

– Server farms and laptops

What are the consequences?

High performance requires multithreading– Currently this is mostly server workloads

– Games are next, today they use 2-4 threads

High performance requires vector programming– Can we work on 4, 16 or more variables simultaneously?

Many programs (and most programmers) don't care!– If it is fast enough today, it will surely be OK in the future as

well?

Not neccessarily, because– Data grows exponentially!

HPC applications Seismic processing

– PC with

– Complete model of small fields

– Reduced resolution test runs for larger fields

– Deskside server with nearly the same capability as current 2048-cpu seismic cluster

Crash simulation– Everything could fit on a laptop in 2012-2015

Financial modelling, incl Monte Carlo risk analysis

Dynamic global process control

From current Unix cluster…

… to deskside workstation in 5 years?

Summary

Multicore will give us an extra factor of ~10 increase in fp processing power

Moore's law will go on

MIPS/Watt is crucial

Evry is at leading edge of this development

Thank you!

Do we have the required programmers? Will we get them from the universities in the future?

– Possibly

– Today, most graduates learn only Java, which isn't very suitable

There's hope:– LRB on the NTNU CS curriculum today

Similar situation at most universities

Can our standard vendors deliver updated SW?– Eclipse, GeoFrame, Sismage, Ansys, Finite Element

Smaller transistors & slightly larger chipsSmaller transistors & slightly larger chips

1975 1980 1985 1990 1995 2000 2005 2010

10

100

1000

10000

f(x) = 1,47E+131·0,86 x̂

Transistor size

Tr. size(nm)Exponential Regression for Tr. size(nm)

Year

nan

omet

er

Multicore – The future of Computing Chief Engineer Terje Mathisen.

Documents

Transcript of Multicore – The future of Computing Chief Engineer Terje Mathisen.