Intel® Xeon Phi™ Coprocessor - Software...
Transcript of Intel® Xeon Phi™ Coprocessor - Software...
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® Xeon Phi™ Coprocessor
1
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Agenda
• Introduction
• Intel® Xeon Phi™ Architecture
• Programming Models
• Outlook
• Summary
2
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® Many Integrated Core
Architecture (Intel® MIC) Intel® Multicore
Architecture
Performance and performance/watt
optimized for highly parallelized
compute workloads
Common software tools with Xeon
enabling efficient application readiness
and performance tuning
IA extension to Manycore
Many cores/threads with wide SIMD
Foundation of HPC Performance
Suited for full scope of workloads
Industry leading performance and
performance/watt for serial & parallel
workloads
Focus on fast single core/thread
performance with “moderate” number of
cores
3
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel Architecture Multicore and Manycore More cores. Wider vectors. Co-Processors.
Intel® Xeon® processor
64-bit
Intel Xeon
processor
5100 series
Intel Xeon processor
5500 series
Intel Xeon processor
5600 series
Intel Xeon
processor E5 Product Family
Intel Xeon
processor code
name
Ivy Bridge
Intel Xeon
processor code
name
Haswell
Intel® Xeon Phi™
Coprocessor
Core(s) 1 2 4 6 8 12 TBD
61
244 Threads 2 2 8 12 16 24
Intel® Xeon Phi™ Coprocessor extends established CPU architecture and programming concepts to highly parallel applications
Images do not reflect actual die sizes. Actual production die may differ from images.
4
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Introducing Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery
5
Groundbreaking: differences
Up to 61 IA cores/1.1 GHz/ 244 Threads
Up to 16GB memory with up to 352 GB/s bandwidth
512-bit SIMD vector instructions
Linux operating system, IP addressable
Standard programming languages and tools
Leading to Groundbreaking results
Over 1 TeraFlop/s double precision peak performance1 Up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2
Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
MPSS - MIC Software Stack
MIC operating system is Linux ! Host OS: Linux;Windows* to be added No need to deal with the low-level APIs !
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
3xxx Family Outstanding Parallel Computing Solution
Performance/$ leadership
Intel® Xeon Phi™ Coprocessors Introduced Full Portfolio June 17 @ ISC
3120P 3120A
5xxx Family Optimized for High Density Environments
Performance/watt leadership
5110P 5120D
7xxx Family Highest Level of Features
Performance leadership
7120P 7120X
16GB GDDR5
352 GB/s
> 1.2 TFlops DPT
8GB GDDR5
>300 GB/s
>1 TFlops DP
6GB GDDR5
240 GB/s
>1 TFlops DP
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Announced at ISC’13
• Tianhe-2 (“MilkyWay 2”): Largest supercomputer of the world installed in Guangzhou, China
– First in Top500 list
– 55 peta flops peak performance
8
• 30000 Intel® Xeon™ processors E5-2600-V2
– Based on 3rd Generation Intel Core architecture code name Ivy Bridge
• 48000 Intel® Xeon Phi™ processors !
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Agenda
• Introduction
• Intel® Xeon Phi™ Architecture
• Programming Models
• Outlook
• Summary
9
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® Xeon Phi™ Architecture Overview
10
Up to 61 core s, at 1.1 GHz
in-order, support 4 threads
512 bit Vector Processing Unit
32 native registers
Reliability Features
Parity on L1 Cache, ECC on memory
CRC on memory IO, CAP on memory IO
High-speed bi-directional
ring interconnect
Fully Coherent L2 Cache
8 memory controllers
16 Channel GDDR5 MC
PCIe GEN2
Up to ~350GB/sec BW
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Core Architecture Overview
Scalar Unit based on Intel® Pentium® processor:
Two pipelines
Dual issue with vector instructions
Pipelined one-per-clock scalar throughput
SIMD Vector Processing Engine:
4 hardware threads per core
4 clock latency, hidden by round-robin scheduling of threads
Cannot issue back to back inst in same thread !!
Coherent 512KB L2 Cache per core
11
Ring
Scalar
Registers
Vector
Registers
512K L2 Cache
32K L1 I-cache 32K L1 D-cache
Instruction Decode
Vector Unit
Scalar Unit
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Vector Processing Unit Extends the Scalar IA Core
12
Pipe 0 (u-pipe) Pipe 1 (v-pipe)
Decoder uCode
L1 TLB and L1 instruction cache
32KB
X87 RF Scalar RF VPU RF
VPU 512b SIMD
L1 TLB and L1 Data Cache 32 KB
X87 ALU 0 ALU 1
TLB Miss Handler
L2 TLB
CRI
512KB L2 Cache
HWP for L2
Thread 0 IP
Thread 1 IP
Thread 2 IP
Thread 3 IP
D2 PPF PF D0 D1 WB E
On-Die Interconnect
Instruction Cache Miss
TLB miss
16B/cycle ( 2 IPC)
TLB miss
Data Cache Miss
4 threads in-order
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Agenda
• Introduction
• Intel® Xeon Phi™ Architecture
• Programming Models
• Outlook
• Summary
13
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
One Source Base, Tuned to many Targets
Multicore
Source
Many-core Cluster
Compilers, Libraries, Parallel Models
Multicore CPU
Multicore CPU
Intel® MIC Architecture
Multicore Cluster
Multicore and Many-core Cluster
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Phase Product Feature Benefit
Build
Intel® Advisor XE
Threading design assistant (Studio products only)
• Simplifies, demystifies, and speeds parallel application design
Intel® Composer XE
• C/C++ and Fortran compilers • Intel® Threading Building Blocks • Intel® Cilk™ Plus • Intel® Integrated Performance
Primitives • Intel® Math Kernel Library
• Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to many-core
Intel® MPI Library†
High Performance Message Passing (MPI) Library
• Enabling High Performance Scalability, Interconnect Independence, Runtime Fabric Selection, and Application Tuning Capability
Verify & Tune
Intel® VTune™
Amplifier XE
Performance Profiler for optimizing application performance and scalability
• Remove guesswork, saves time, makes it easier to find performance and scalability bottlenecks
Intel® Inspector XE
Memory & threading dynamic analysis for code quality
Static Analysis for code quality
• Increased productivity, code quality, and lowers cost, finds memory, threading , and security defects before they happen
Intel® Trace Analyzer & Collector†
MPI Performance Profiler for understanding application correctness & behavior
• Analyze performance of MPI programs and visualize parallel application behavior and communications patterns to identify hotspots
Intel® Developer Products - Intel® Cluster Studio XE 2013
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® AVX Vector size: 256 bit Data types: • 32 and 64 bit float VL: 4, 8, 16
Intel® MIC Vector size: 512 bit Data types: • 32 and 64 bit integer • 32 and 64 bit float VL: 8,16
X4
Y4
X4◦Y4
X3
Y3
X3◦Y3
X2
Y2
X2◦Y2
X1
Y1
X1◦Y1
0
X8
Y8
X8◦Y8
X7
Y7
X7◦Y7
X6
Y6
X6◦Y6
X5
Y5
X5◦Y5
255
X4
Y4
X4◦Y4
X3
Y3
X3◦Y3
X2
Y2
X2◦Y2
X1
Y1
X1◦Y1
0
X8
Y8
X8◦Y8
X7
Y7
X7◦Y7
X6
Y6
X6◦Y6
X5
Y5
X5◦Y5
X16
Y16
X16◦Y16
…
...
…
511
Illustrations: Xi, Yi & results 32 bit float
Data Parallelism of Intel® Processors (2)
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® Cilk™ Plus Array Notation
<array base> [<lower bound>:<length>[:<stride>]]+
A[:] // All of vector A
B[2:6] // Elements 2 to 7 of vector B
C[:][5] // Column 5 of matrix C
D[0:3:2] // Elements 0,2,4 of vector D
+ + + + + + + +
if (a[:] > b[:])
c[:] = d[:] * e[:];
else
c[:] = d[:] * 2;
A simple and elegant solution:
a language construct for vector level parallelism
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Vectorization for MIC: A new Target In
pu
t: C
/C+
+/FO
RTR
AN
so
urc
e co
de
Vectorizer: Map vector parallelism
to vector ISA
Intel® SSE Intel® AVX Intel® MIC
Express/expose vector parallelism
Array Notation
SIMD pragma
Vectorization Hints (ivdep/vector pragmas)
Fully Automatic Analysis
Elemental Function
Optimize and Code Gen
Dat
a p
aral
lel p
art
of
Inte
l® C
ilk™
Plu
s ex
ten
sio
n
Vectorizer makes
retargeting easy!
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Many-Core Hosted – “Native” Model
• Enabled by –mmic compiler switch
• Fully supported by compiler vectorization, Intel® MKL, OpenMP*, Intel® TBB, Intel® Cilk Plus, Intel® MPI, …
• Might be an option for some applications:
– Needs to fit into memory !
– Should be highly parallel code
• Serial parts are slower on MIC than on host !
– Limited access to external environment like I/O
• Native MIC file system exists in memory only !
• NFS allows external I/O but …
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Many-Core Hosted – with MPI
• MPI ranks on Intel® Xeon PhiTM coprocessors(only)
• All messages into/out of Intel® Xeon PhiTM coprocessors
• Programmed as homogenous network of many-core CPUs:
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
20
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Heterogeneous Programming
Fortran (CAF)
MKL
TBB
OpenCL
Cilk Plus
C++
Tools
OpenMP
Fortran (CAF)
TBB
OpenCl
Cilk Plus
C++
MKL
Parallel programming is the same on MIC and CPU
OpenMP
Tools
PC
Ie
CPU Executable MIC Native
Executable
Pa
ralle
l
Co
mp
ute
Pa
ralle
l
Co
mp
ute
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Heterogeneous Programming
Fortran (CAF)
MKL
TBB
OpenCL
Cilk Plus
C++
Tools
OpenMP
Fortran (CAF)
TBB
OpenCl
Cilk Plus
C++
MKL
Parallel programming is the same on MIC and CPU
OpenMP
Tools
PC
Ie
CPU Executable MIC Native
Executable
Pa
ralle
l
Co
mp
ute
Pa
ralle
l
Co
mp
ute
Offload Directives (Non-Shared Model)
Offload Keywords (Virtual Shared-Memory)
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Choices for Offloading Application Code
Two Intel-specific models for offloading are supported ( Intel® Composer 2013 ):
LEO: Language Extensions for Offload for C/C++ and Fortran
Explicit data transfer by compiler directives
MYO: Mine-Your-Ours for C/C++
“Shared virtual memory” – implicit offload controlled by language extensions for variable declaration etc
Offloading and parallelism is orthogonal
Offloading only transfers control to the MIC devices
Parallelism needs to be exploited by a second model (e.g. OpenMP*)
23
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
LEO: Explicit Offload
Completely realized by directives and ‘attributes’
Ignored by other compilers ( might result in a warning )
Requires ‘bit-wise’ copyable data objects
Programmer designates variables that need to be copied between host and card in the offload directive
C/C++ Example: #pragma offload target(mic) in(data:length(size))
Fortran Example: !dir$ offload target(mic in(a1:length(size))
• Very much influenced accelerator (“target”) extension of coming OpenMP* 4.0 standard #pragma omp target map(to(b:count)) map(from(a:count))
24
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
C/C++ Extensions for explicit Offload
C/C++ Syntax Semantics
Offload pragma #pragma offload <clauses>
<statement block>
Allow next statement block to execute on Intel® MIC Architecture or host CPU
Keyword for variable & function definitions
__attribute__((target(mic))) Compile function for, or allocate variable on, both CPU and Intel® MIC Architecture
Entire blocks of code #pragma offload_attribute(push,
target(mic))
#pragma offload_attribute(pop)
Mark entire files or large blocks of code for generation on both host CPU and Intel® MIC Architecture
Data transfer #pragma offload_transfer
target(mic)<clauses> Initiates asynchronous data transfer, or initiates and completes synchronous data transfer
Intel® Many Integrated Core Architecture 25
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Example
26
float reduction(float *data, int numberOf)
{
float ret = 0.f;
#pragma offload target(mic) in(data:length(numberOf))
{
#pragma omp parallel for reduction(+:ret)
for (int i=0; i < numberOf; ++i)
ret += data[i];
}
return ret;
}
Note: copies numberOf elements to the coprocessor, not numberOf*sizeof(float) bytes – the compiler knows data’s type
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Example: Call Intel® MKL on Coprocessor
int main{
// initialize variables …
#pragma offload target(mic) \
in(transa,transb, N, alpha, beta) \
in(A:length(matrix_elements)) \
in(B:length(matrix_elements)) \
inout(C:length(matrix_elements)) \
sgemm(&transa, &transb, &N, &N, &N, &alpha,
A, &N, B, &N, &beta, C, &N);
}
sgemm performs C=beta*C+alpha*A*B, transa and transb regulate the transposition of A and B and the Ns define the sizes of the matrices (see documentation). C is input and output, all others are input only.
MKL will automatically make optimal use of MIC
27
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
MYO: Implicit Offload
Real language extensions
Not accepted by other compilers
Alternative model for non-compact data objects like a linked list
Programmer marks variables that should be (virtually) shared between host and card
Run-time system automatically maintains coherence at boundary of offloaded code regions
Sample:
_Cilk_shared double foo;
_Offload func(y);
28
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
MYO Memory Model
Section of memory maintained at the same (!!) virtual address on both the host and coprocessor
Reserving same address range on both devices allows
Seamless sharing of complex pointer-containing data structures
Elimination of user marshaling and data management
Use of simple language extensions to C/C++
Host Memory
MIC
Memory
Offload code
C/C++ executable
Host
Intel® MIC
Same address range
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
MYO: _Cilk_shared for Data & Routines
Intel® Many Integrated Core Architecture 30
What Syntax Semantics
Function int _Cilk_shared f(int x)
{ return x+1; }
Versions generated for both CPU and card; may be called from either side
Global _Cilk_shared int x = 0; Visible on both sides
File/Function static static _Cilk_shared int
x;
Visible on both sides, only to code within the file/function
Class class _Cilk_shared x {…}; Class methods, members, and and operators are available on both sides
Pointer to shared data int _Cilk_shared *p; p is local (not shared), can point to shared data
A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data
Entire blocks of code #pragma offload_attribute(
push, _Cilk_shared) #pragma
offload_attribute(pop)
Mark entire files or large blocks of code _Cilk_shared using this pragma
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Example // Shared variable declaration for pi
_Cilk_shared float pi;
int main()
{
int count = 10000;
// Initialize shared global
// variables
pi = 0.0f;
// Compute pi on target
_Offload compute_pi(count);
pi /= count;
}
31
_Cilk_shared void compute_pi(int count)
{
int i;
#pragma omp parallel for \
reduction(+:pi)
for (i=0; i<count; i++)
{
float t = (float)((i+0.5f)/count);
pi += 4.0f/(1.0f+t*t);
}
}
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Offload Models plus MPI
• MPI ranks on Intel® Xeon® processors (only)
• All messages into/out of processors
• Offload models used to accelerate MPI ranks
• Homogenous network of hybrid nodes:
Xeon
MIC
Xeon
MIC
Xeon
MIC
Xeon
MIC
Network
Data
Data
Data
Data
MPI
Offload
32
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
LLVM* Standard Passes
LLVM* Vectorizer: • Scalarizer • Divergence Analysis • Predicator • Packetizer • bypasses
LLVM* OpenCL Passes: • Barriers • Builtins • Kernel Arguments
LLVM* Standard Passes
LLVM* IR
LLVM* IR
LLVM* IR
Intel® Xeon Phi™ Coprocessor OpenCL* Compiler
OpenCL* LLVM Compiler Optimizer
Clang*
Code Generator
Xeon Phi Code
LLVM* IR
OpenCL*
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® VTune™ Amplifier XE 2013 for Intel® Xeon Phi™
34 7/23/2013
Beginning with update 4, the methodology in this presentation is implemented in the “General
Exploration” profile
Select which problem areas you want to analyze
Events to collect will be configured automatically
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
MIC Debugging with GDB* • Run Host-MIC GDB* on your localhost (you can’t use default
host gdb!)
/usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb
Start gdbserver on the Intel® Xeon Phi™Coprocessor
• To remote debug using pipe to ssh (gdb) target extended-remote | ssh –T mic0 ~/gdbserver –multi IP:port
• To remote debug using stdio (gdb) target extended-remote | ssh -T mic0 ~/gdbserver –multi -
To attach to a running application via the process-id (pid)
(gdb) file /local/path/to/application
(gdb) attach <remote-pid>
To run an application directly from GDB* (gdb) file /local/path/to/application
(gdb) set remote exec-file /target/path/to/application
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Extension to Microsoft Windows* OS
• Microsoft Window being added as second host operating system
– Windows 2008 Server
– No change for operating system of coprocessor – remains Linux
• Developer environment same as for Linux
– Same programming models
• In beta testing today
• To be released H2/2013
36
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Future Product Line: Knights Landing
• Knights Landing is the code name for the 2nd generation product for the Intel® Many Integrated Core Architecture
• Knights Landing targets Intel’s 14nm manufacturing process
• Kights Landing will be productized as a processor (running the host OS) and as a coprocessor ( a PCI end-point device )
• Knights Landing will feature on-package high-bandwidth memory
37
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Agenda
• Introduction
• Intel® Xeon Phi™ Architecture
• Programming Models
• Outlook
• Summary
38
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Summary Intel® Xeon Phi™ coprocessor is a real product now !
The flexibility of the programming models offer a solution for all needs: From a single MIC card in a workstation up to thousands of MIC cards attached to HPC cluster nodes
Compilers and other tools make it easy to develop or port code to run applications natively on the coprocessors, heterogeneously on host CPU + Intel® MIC systems and collections of these connected in a compute cluster
No new parallel programming models needed: All Intel-supported models available for MIC too – including innovative models like Coarray-Fortran
Using MIC is a simple extension of CPU programming
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Get Educated on Intel® Xeon Phi™ Coprocessors • http://software.intel.com/mic-developer
Intel® Xeon Phi™ Coprocessor (codename Knights Corner): http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner
• Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Reference Manual: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding
• An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors: http://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors_1.pdf
• Intel® Manycore Platform Software Stack (MPSS): http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
• Intel® Xeon Phi™ Coprocessor Developer's Quick Start Guide: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide
• Intel® Xeon Phi™ Coprocessor System Software Developers Guide: http://software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemsoftwaredevelopersguide.pdf
• Intel and Third Party Tools and Libraries available with support for Intel® Xeon Phi™ Coprocessor: http://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm
• Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors - Part 1: Optimization Essentials: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization
• Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors, Part 2: Understanding and Using Hardware Events: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding
Tools That Enable You to be Ready!
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others. 41
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Vector Instruction Performance
VPU contains 16 SP ALUs, 8 DP ALUs,
Most VPU instructions have a latency of 4 cycles and TPT 1 cycle
Load/Store/Scatter have 7-cycle latency
Convert/Shuffle have 6-cycle latency
VPU instruction are issued in u-pipe
Certain instructions can go to v-pipe also
Vector Mask, Vector Store, Vector Packstore, Vector Prefetch, Scalar
43
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Spectrum of Execution Models
General purpose serial and parallel
computing
Codes with highly- parallel phases
Highly-parallel codes
Codes with balanced needs
CPU-Centric Intel® MIC-Centric
Manycore Hosted Symmetric Offload Multi-core Hosted
Main( ) Foo( ) MPI_*()
Foo( )
Main( ) Foo( ) MPI_*()
Main() Foo( ) MPI_*()
Main( ) Foo( ) MPI_*()
Main( ) Foo( ) MPI_*()
Multi-core
Many-core
Intel® Many Integrated Core (MIC)
Codes with serial phases
Reverse Offload
Main( ) Foo( ) MPI_*()
Foo( )
PCIe
Intel® Xeon Processor
Supported with Intel Tools
|---- Heterogenous Models -----|