Multi-Core-Architectures for Numerical Simulation€¦ · 1 Multi-Core-Architectures for Numerical...
Transcript of Multi-Core-Architectures for Numerical Simulation€¦ · 1 Multi-Core-Architectures for Numerical...
1
Multi-Core-Architecturesfor Numerical Simulation
Lehrstuhl für Informatik 10 (Systemsimulation)
Universität Erlangen-Nürnberg
www10.informatik.uni-erlangen.de
Siemens Simulation Center
11. November 2009
H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter,
C. Feichtinger, K. Iglberger (LSS Erlangen und RRZE)
U. Rüde (LSS Erlangen, [email protected])
In collaboration with RRZE and many more
2
OverviewIntro
Who we are
How fast are computers today?
Technological Trends
GPUs, Cell, and others
Example: Flow Simulation with Lattice Boltzmann Methods
Computational Haemodynamics using the PlayStation
Conclusions
3
The LSS Mission
Development and Analysis of Computer Methodsfor Applications in Science and Engineering
Applications fromPhysical and Engineering
Sciences
ComputerScience
Mathematics
LSS
4
Who is at LSS (and does what?)
• C. Feichtinger
• S. Donath
• C. Mihoubi
• J. Götz
• S. Ganguly
• K. Pickl
• S. Bogner
• !"##"#$%"&'(")"#'$
Alumni
Prof. G. Horton (Univ. of Magdeburg)
Prof. El Mostafa Kalmoun (Cadi Ayyad University, Marocco)
Dr. M. Kowarschik (Siemens Health Care)
Dr. M Mohr (Geophysik, TU München)
Dr. F. Hülsemann (EDF, Paris)
Dr. B. Bergen (Los Alamos, USA)
Dr. N. Thürey (ETZH Zürich)
Dr. J. Härdtlein (Bosch GmbH)
C. Möller (Navigon)
Dr. U. Fabricius (Elektrobit)
Dr. Th. Pohl (Siemens Health Care)
J. Treibig (RRZE)
C. Freundl (YAGER Development)
Laser Simulation
Prof. Dr. C. Pflaum
Supercomputing
J. Götz
Numerical Algorithms
H. Köstler
Complex Flows
K. Iglberger
B. Berneker
M. Wohlmuth
C. Jandl
Kai Hertel
J. Werner
T. Gradl
M. Stürmer
F. Deserno
D. Ritter
B. Gmeiner
S. Geißelsöder
• T. Dreher
• Dr. W. Degen
• T. Preclik
• D. Bartuschat
• S. Strobl
• Li Yi
5
How much is a PetaFlops?106 = 1 MegaFlops: Intel 486
33MHz PC (~1989)
109 = 1 GigaFlops: Intel Pentium III
1GHz (~2000)
If every person on earth does one operation every 6 seconds, all humans together have 1 GigaFlops performance (less than a current laptop)
1012= 1 TeraFlops: HLRB-I
1344 Proc., ~ 2000
1015= 1 PetaFlops
>100 000 Proc. Cores
Roadrunner/Los Alamos: Jun 2008
• If every person on earth runs a 486 PC, we all together have an aggregate Performance of 6 PetaFlops.
HLRB-II: 63 TFlops
HLRB-I: 2 TFlops
6
Where Does Computer Architecture Go?
Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements
Even today a single core CPU is a highly parallel system:
superscalar execution, complex pipeline, ... and additional tricks
Internal parallelism is a major reason for the performance increases until now, but ...
There is a limited amount of parallelism that can be exploited automatically
Multi-core systems concede the architects´ defeat:
Architects fail to build faster single core CPUs given more transistors
Clock rate increases only slowly (due to power considerations)
Therefore architects have started to put several cores on a chip:
programmers must use them directly
7
What are the consequences?
For the application developers “the free lunch is over”
Without explicitly parallel algorithms, the performance potential cannot be used any more
it will become increasingly important to use instruction level parallelism (such as vector units)
For perfromance critical applications:
CPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for this
In the high end will have to deal with systems with millions of cores
8
Trends in Computer Architecture
On Chip Parallelism for everyone
instruction level
SIMD-like vectorization
multicore (with caches or local memory)
Off Chip parallelism
for large scale parallel systems
Accelerator hardware
GPUs
Cell processor
Limits to clock rate
Limits to memory bandwidth
Limits to memory latency
Multi-Core Aktivitäten am LSS
Architekturen
IBM Cell
GPU
• Nvidia
• AMD/ATI
Konventionelle Mehrkernarchitekturen (Intel, AMD)
Anwendungen Finite Elemente - PDE - Mehrgitterverfahren (Strömungslöser LBM-Verfahren
Bildverarbeitung, medizintechnische Anwendungen
3D-Realzeitsimulation für industrielle Steuerung
Siehe Papers, Berichte, Master- und Bachelorarbeiten:
http://www10.informatik.uni-erlangen.de/Publications/
9
10
Multi Core Architectures
IBM-Sony-Toshiba Cell Processor
GPU: Nvidia or AMD/ATI
11
The STI Cell Processor
hybrid multicore processor based on IBM Power architecture
(simplified) PowerPC core
runs operating system
controls execution of programs
multiple co-processors (8, on Sony PS3 only 6 available)
operate on fast, private on-chip memory
optimized for computation
vectorization: „float4“ data type
DMA controller copies data from/to main memory
• multi-buffering can hide main memory latencies completely for streaming-like applications
• loading local copies has low and known latencies
memory with multiple channels and banks can be exploited if many memory transactions are in-flight
12
IBM Cell Processor Available cell systems:
Roadrunner
Blades
Playstation 3
Cell Architecture: 9 cores on a chip
13
14
GPUs
massively parallel SIMD-like execution on several hundred compute units
typical performance values (Nvidia Fermi, soon):
2.7 TFlop single precision possible
630 Gflop double precision
4+x GByte memory „on board“
150+x GByte/sec memory bandwidth
additionally vectorization in „warps“ (16 floats)
ATI Radeon HD 4870
Costs: 150 !
Interface: PCI-E 2.0 x16
Shader Clock: 750 MHz
Memory Clock: 900 MHz
Memory Bandwidth: 115 GB/s
FLOPS: 1200 GFLOPS
Max Power Draw: 160 W
Framebuffer: 1024 MB
Memory Bus: 256 bit
Shader Processors: 800
15
Nvidia GeForce GTX 295
Costs: 450 !
Interface: PCI-E 2.0 x16
Shader Clock: 1242 MHz
Memory Clock: 999 MHz
Memory Bandwidth: 2x112 GB/s
FLOPS: 2x894 GFLOPS
Max Power Draw: 289 W
Framebuffer: 2x896 MB
Memory Bus: 2x448 bit
Shader Processors: 2x240
16
GPU: AMD Stream Processor
17
AMD Stream Architecture (cont‘d)
18
ATI Radeon 3870(RV670) / Firestream 9170
19
Example 1: Flow Simulation on Cell
20
LBM Optimized for Cell
memory layout
optimized for DMA transfers
information propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange
code optimization
kernels hand-optimized in assembly language
SIMD-vectorized streaming and collision
branch-free handling of bounce-back boundary conditions
21
Simulation ofMetal Foams
Free Surface Flows
Applications:
Engineering: metal foam simulations
Computer graphics: special effects
Based on LBM:
Mesoscopic approach to solving the NS equations
Good for complex boundary conditions
Details: D3Q19 model, BGK collision and grid compression
22
Performance Results
0
12,5
25,0
37,5
50,0
Xeon 5160 PPE SPE*
49,0
2,04,8
10,4
LBM performance on a single core (8x8x8 channel flow)
*on Local Store without DMA transfers
straight-forward C codeSIMD-optimized assembly
23
Performance Results
30,0
47,5
65,0
82,5
100,0
1 2 3 4 5 6
95949493
81
42
24
Performance Results
0
12,5
25,0
37,5
50,0
Xeon 5160* Playstation 3
43,8
11,7
21,1
9,1
1 core
1 CPU*performance optimized code by LB-DC
25
Programming the Cell-BE
the hard way
control SPEs using management libraries
issue DMAs by language extensions
do address calculations manually
exchange main memory addresses, array sizes etc.
synchronization using mailboxes, signals or libraries
frameworks
Accelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBM
Rapidmind SDK
accelerated libraries
single-source-compiler
IBM’s xlc-cbe-sse, uses OpenMP
26
Naive SPU implementation: A[] = A[]*cvolatile vector float ls_buffer[8] __attribute__((aligned(128)));
void scale( unsigned long long gs_buffer, // main memory address of vector
int number_of_chunks, // number of chunks of 32 floats
float factor ) { // scaling factor
vector float v_fac = spu_splats(factor); // create SIMD vector with all
// four elements being factor
for ( int i = 0 ; i < number_of_chunks ; ++i ) {
mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0); // DMA reading i-th chunk
mfc_write_tag_mask( 1 << 0 ); // wait for DMA...
mfc_read_tag_status_all(); // ...to complete
for ( int j = 0 ; j < 8 ; ++j )
ls_buffer[j] = spu_mul( ls_buffer[j] , v_fac ); // scale local copy using SIMD
mfc_put( ls_buffer ,gs_buffer , 128 , 0 ,0,0); // DMA writing i-th chunk
mfc_write_tag_mask( 1 << 0 ); // wait for DMA...
mfc_read_tag_status_all(); // ...to complete
gs_buffer += 128; // incr. global store pointer
} }
27
Remove latencies using multi-buffering
mfc_get( ls_buffer[0] , gs_buffer , 128 , 0 ,0,0); // request first chunk
for (int i = 0; i < number_of_chunks; ++i) {
int cur = ( i ) % 3; // buffer no. and DMA tag for i-th !chunk
int next = (i+1) % 3; // " for (i-2)-th and (i+1)-th chunk
if (i < number_of_chunks-1) {
mfc_write_tag_mask( 1 << next ); // make sure the (i-2)-th chunk...
mfc_read_tag_status_all(); // ...has been stored
mfc_get( ls_buffer[next] , gs_buffer+128 , 128 , next ,0,0); // request (i+1)-th chunk
}
mfc_write_tag_mask( 1 << cur ); // wait until i-th chunk...
mfc_read_tag_status_all(); // ...is available
for (int j = 0; j < 8; ++j) ls_buffer[cur][j] = spu_mul(ls_buffer[cur][j],v_fac);
mfc_put( ls_buffer[cur] , gs_buffer , 128 , cur ,0,0);// store i-th chunk
gs_buffer += 128;
}
mfc_write_tag_mask( 1 | 2 | 4 ); // wait for any...
mfc_read_tag_status_all(); // outstanding DMA
volatile vector float ls_buffer[3][8]
__attribute__((aligned(128)));
...
28
Example 2: LBM on Graphics Cards
Johannes Habich, M.Sc. [email protected] 29
OpenMP vs. CUDA
! Alignment constraints must be met
! No cache and cache lines must be considered
0
3
1
4
2
5
Thread 2
CUDA! Divide domain into small pieces
Thread 1
Thread 1
Thread 0
Thread 0
Thread 2
Block 1
Block 2
0
1
2
3
4
5
Thread 0
Thread 1
Thread 2
OpenMP! Divide domain to huge chunks
Johannes Habich, M.Sc. [email protected] 30
Performance Results for GPUs
! Up to 6 times CPU performance if well implemented
! Less than CPU performance if straightforwardly implemented
! Use padding to circumvent performance breakdown (80% loss)
Arbitrary Geometries
31
Part IV
Conclusions
Computer Science X - System Simulation Group Markus Stürmer ([email protected])
Evolution of processors: Improvements
Std.-CPU CBEA GPU
pipelining
superscalar execution
out-of-order
wider buses
SIMD
multithreading
multiprocessing
caches
hardware prefetcher
resource virtualization
! ! !
! ! "
! " "
! ! !
! ! !
! ! / " !
! ! !
! ! / " !
! ? / " "
" / ! !
! ! / " "
instruction
data
thread
transfer
local storage "
33
Conclusions
There is no way around Multi-Core architectures in the forseeable future
Multi-Core Accelerators have excellent performance potential, but ....
they are not suitable for all algorithms
they are tricky to program
many results published in the literature are too optimistic
• people tend to make unfair comparisons
accellerators and their programming environments are changing very quickly
what about robustness, numerical accuracy, etc?
there is a good chance that we will see these technologies in the future, but likely in different guise
34
35
Thank you for your attention
Questions?
Slides, papers, reports, thesis, animations available for
download at: www10.informatik.uni-erlangen.de