Multicore – The future of Computing Chief Engineer Terje Mathisen.
Transcript of Multicore – The future of Computing Chief Engineer Terje Mathisen.
Multicore – The future of ComputingChief EngineerTerje Mathisen
Moore’s Law
«The number of transistors we can put on a chip will double every two years»– Originally from 1965, modified in 1975
– Up to around the turn of century this meant a doubling in performance every 18 months.
– Power has become the worst problem.
– Bipolar transistors->NMOS->CMOS->(lots of tweaks)->3D
– Voltage scaling
– Today, leakage current is a limiter
– Even CMOS transistors leak when they get really tiny
Moore's Law has held for 40 years
1975 1980 1985 1990 1995 2000 2005 2010 201510000
100000
1000000
10000000
100000000
1000000000
10000000000
Haswell: 5,6e9, 22nm
What could we use all the transistors for? Increase scalar performance
Increasingly more complicated cpus
Multiple cycles/instruction:– 8088 (29K) – 80286 (134K) – 80386 (275K)
Pipeline, one cycle/instruction– 80486 (1,2M)
Superscalar: Multiple instructions/cycle– Pentium (3,1M) (two in-order pipelines)
Out of order/superscalar/multithreaded– Pentium Pro/Pentium III/Pentium4/Core/etc (5,5M --> 5,6B)
Pentium4 had the fastest pipeline ever 3 Ghz clock
– Inner core ran at 2x, i.e. 6 Ghz
– Only simple instructions, like ADD/SUB/AND/OR
Guessing at branches– If (a > b) {...} else {…}
Mistakes were very costly, both in time and power – 10 to 200 wasted instructions each time the cpu guessed wrong!
Core 2: Multiple complicated cores
Running two individual processes in parallel causes fewer wasted instructions, leads to more power-efficient computing.– Shorter pipelines are better at branching
– Object-oriented programming uses many branches
Every two years: Double the number of cores– Core 2 –> Core 2 Duo -> Core 2 Quad
– Latest server cpus have up to 18 cores, using 5.6e9 transistors
Vector operations
SIMD: Work with more data in each instruction– SSE uses 16-byte vectors (4 float/2 double)
– AVX uses 32-byte vectors (8 float/4 double)
Each core can do two SSE operations/cycle– Quad cpu with 4*2*4 = 32 fp operations/cycle
– 64 Gflops @ 2 GHz, 100 Gflops @ 3+ GHz
– High-end AVX implementation doubles this, 12-18 cores add another multiplier
Other CPU architectures• Sun Sparc
• 2005: Niagara: 8 cores, 4 threads/core, low clock speed
• Multithreaded server workloads
• Oracle Sparc M7• 2014: 32 cores, 8 threads/core
• Optimized for DB operations
Other CPU architectures Sparc
– Multithreaded server workloads IBM/Sony Cell
– 2005: Playstation 3
– 1 PPE + 7-8 SPE cores, each capable of 25 Gflops/s
– Works on 16-byte vectors (4 float/2 double)
– ~200 Gflops SP -> 14 Gflops DP
– Special HPC version with 100+ Gflops DP
Other CPU architectures
Sun Sparc
IBM/Sony Cell
GPGPU– Graphics cards with semi-general fp pipelines
Intel Larrabee/Many Integrated Core /Xeon Phi Project started 2003
– Architecture review Oct 2006
Announced 2007– 64-bit
– x86 compatible
Similar to Pentium– Dual in-order pipelines
– More flexible mixing of instructions
Special graphics instructions, incl. scatter/gather– S/G are very useful for HPC applications
LRB cont. Even longer vectors
– Works with 64-byte blocks (16 float/8 double)
– Combined FMUL/FADD instruction
More than 50 cores on first product– 4 threads/core
– 16x2x51 = 1616 flops/cycle
– 1.3 Ghz core -> 2 Tflop (Seismic cluster is ~10 Tflops)
First product will be graphics coprocessor card
Will use the same 125 watts (max) as a single P4
New name: Many Integrated Core (MIC)/ Knights Corner/ Xeon Phi
Future directions
Heterogeneous cpus:– Maybe 2-4 Core2 + 20-60 Larrabee?
– Run single-threaded applications on Core, multi-threaded/vector-based on Xeon Phi. (2013 - Fastest computer in the world: Ivy Bridge+Phi)
– OS threads without fp operations can also use simple in-order LRB cores
Power-efficient processing– Both laptops/mobiles and servers are limited by power use
Simpler/slower cores with mostly in-order processing can use 80% less power
Conclusion
Multicore will give us an extra factor of ~10 increase in fp processing power– Most current forms of simulation becomes possible on a single
workstation with 2-4 cpus
MIPS/Watt is crucial– Easier to make many simpler cores than one complex
– Less wasted work
– Server farms and laptops
What are the consequences?
High performance requires multithreading– Currently this is mostly server workloads
– Games are next, today they use 2-4 threads
High performance requires vector programming– Can we work on 4, 16 or more variables simultaneously?
Many programs (and most programmers) don't care!– If it is fast enough today, it will surely be OK in the future as
well?
Not neccessarily, because– Data grows exponentially!
HPC applications Seismic processing
– PC with
– Complete model of small fields
– Reduced resolution test runs for larger fields
– Deskside server with nearly the same capability as current 2048-cpu seismic cluster
Crash simulation– Everything could fit on a laptop in 2012-2015
Financial modelling, incl Monte Carlo risk analysis
Dynamic global process control
From current Unix cluster…
… to deskside workstation in 5 years?
Summary
Multicore will give us an extra factor of ~10 increase in fp processing power
Moore's law will go on
MIPS/Watt is crucial
Evry is at leading edge of this development
Thank you!
Do we have the required programmers? Will we get them from the universities in the future?
– Possibly
– Today, most graduates learn only Java, which isn't very suitable
There's hope:– LRB on the NTNU CS curriculum today
Similar situation at most universities
Can our standard vendors deliver updated SW?– Eclipse, GeoFrame, Sismage, Ansys, Finite Element
Smaller transistors & slightly larger chipsSmaller transistors & slightly larger chips
1975 1980 1985 1990 1995 2000 2005 2010
10
100
1000
10000
f(x) = 1,47E+131·0,86 x̂
Transistor size
Tr. size(nm)Exponential Regression for Tr. size(nm)
Year
nan
omet
er