Intel® Xeon Phi™ Coprocessor - Software...

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.

and/or other countries. *Other names and brands may be claimed as the property of others.

Intel® Xeon Phi™ Coprocessor

1

http://software.intel.com/en-us/articles/optimization-notice



Agenda

• Introduction

• Intel® Xeon Phi™ Architecture

• Programming Models

• Outlook

• Summary

2




Intel® Many Integrated Core

Architecture (Intel® MIC) Intel® Multicore

Architecture

Performance and performance/watt

optimized for highly parallelized

compute workloads

Common software tools with Xeon

enabling efficient application readiness

and performance tuning

IA extension to Manycore

Many cores/threads with wide SIMD

Foundation of HPC Performance

Suited for full scope of workloads

Industry leading performance and

performance/watt for serial & parallel

workloads

Focus on fast single core/thread

performance with “moderate” number of

cores

3




Intel Architecture Multicore and Manycore More cores. Wider vectors. Co-Processors.

Intel® Xeon® processor

64-bit

Intel Xeon

processor

5100 series

Intel Xeon processor

5500 series

Intel Xeon processor

5600 series

Intel Xeon

processor E5 Product Family

Intel Xeon

processor code

name

Ivy Bridge

Intel Xeon

processor code

name

Haswell

Intel® Xeon Phi™

Coprocessor

Core(s) 1 2 4 6 8 12 TBD

61

244 Threads 2 2 8 12 16 24

Intel® Xeon Phi™ Coprocessor extends established CPU architecture and programming concepts to highly parallel applications

Images do not reflect actual die sizes. Actual production die may differ from images.

4




Introducing Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery

5

Groundbreaking: differences

Up to 61 IA cores/1.1 GHz/ 244 Threads

Up to 16GB memory with up to 352 GB/s bandwidth

512-bit SIMD vector instructions

Linux operating system, IP addressable

Standard programming languages and tools

Leading to Groundbreaking results

Over 1 TeraFlop/s double precision peak performance1 Up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2

Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3




MPSS - MIC Software Stack

MIC operating system is Linux ! Host OS: Linux;Windows* to be added No need to deal with the low-level APIs !




3xxx Family Outstanding Parallel Computing Solution

Performance/$ leadership

Intel® Xeon Phi™ Coprocessors Introduced Full Portfolio June 17 @ ISC

3120P 3120A

5xxx Family Optimized for High Density Environments

Performance/watt leadership

5110P 5120D

7xxx Family Highest Level of Features

Performance leadership

7120P 7120X

16GB GDDR5

352 GB/s

> 1.2 TFlops DPT

8GB GDDR5

>300 GB/s

>1 TFlops DP

6GB GDDR5

240 GB/s

>1 TFlops DP




Announced at ISC’13

• Tianhe-2 (“MilkyWay 2”): Largest supercomputer of the world installed in Guangzhou, China

– First in Top500 list

– 55 peta flops peak performance

8

• 30000 Intel® Xeon™ processors E5-2600-V2

– Based on 3rd Generation Intel Core architecture code name Ivy Bridge

• 48000 Intel® Xeon Phi™ processors !




Agenda

• Introduction



• Outlook

• Summary

9




Intel® Xeon Phi™ Architecture Overview

10

Up to 61 core s, at 1.1 GHz

in-order, support 4 threads

512 bit Vector Processing Unit

32 native registers

Reliability Features

Parity on L1 Cache, ECC on memory

CRC on memory IO, CAP on memory IO

High-speed bi-directional

ring interconnect

Fully Coherent L2 Cache

8 memory controllers

16 Channel GDDR5 MC

PCIe GEN2

Up to ~350GB/sec BW




Core Architecture Overview

Scalar Unit based on Intel® Pentium® processor:

Two pipelines

Dual issue with vector instructions

Pipelined one-per-clock scalar throughput

SIMD Vector Processing Engine:

4 hardware threads per core

4 clock latency, hidden by round-robin scheduling of threads

Cannot issue back to back inst in same thread !!

Coherent 512KB L2 Cache per core

11

Ring

Scalar

Registers

Vector

Registers

512K L2 Cache

32K L1 I-cache 32K L1 D-cache

Instruction Decode

Vector Unit

Scalar Unit




Vector Processing Unit Extends the Scalar IA Core

12

Pipe 0 (u-pipe) Pipe 1 (v-pipe)

Decoder uCode

L1 TLB and L1 instruction cache

32KB

X87 RF Scalar RF VPU RF

VPU 512b SIMD

L1 TLB and L1 Data Cache 32 KB

X87 ALU 0 ALU 1

TLB Miss Handler

L2 TLB

CRI

512KB L2 Cache

HWP for L2

Thread 0 IP

Thread 1 IP

Thread 2 IP

Thread 3 IP

D2 PPF PF D0 D1 WB E

On-Die Interconnect

Instruction Cache Miss

TLB miss

16B/cycle ( 2 IPC)

TLB miss

Data Cache Miss

4 threads in-order




Agenda

• Introduction



• Outlook

• Summary

13




One Source Base, Tuned to many Targets

Multicore

Source

Many-core Cluster

Compilers, Libraries, Parallel Models

Multicore CPU

Multicore CPU

Intel® MIC Architecture

Multicore Cluster

Multicore and Many-core Cluster




Phase Product Feature Benefit

Build

Intel® Advisor XE

Threading design assistant (Studio products only)

• Simplifies, demystifies, and speeds parallel application design

Intel® Composer XE

• C/C++ and Fortran compilers • Intel® Threading Building Blocks • Intel® Cilk™ Plus • Intel® Integrated Performance

Primitives • Intel® Math Kernel Library

• Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to many-core

Intel® MPI Library†

High Performance Message Passing (MPI) Library

• Enabling High Performance Scalability, Interconnect Independence, Runtime Fabric Selection, and Application Tuning Capability

Verify & Tune

Intel® VTune™

Amplifier XE

Performance Profiler for optimizing application performance and scalability

• Remove guesswork, saves time, makes it easier to find performance and scalability bottlenecks

Intel® Inspector XE

Memory & threading dynamic analysis for code quality

Static Analysis for code quality

• Increased productivity, code quality, and lowers cost, finds memory, threading , and security defects before they happen

Intel® Trace Analyzer & Collector†

MPI Performance Profiler for understanding application correctness & behavior

• Analyze performance of MPI programs and visualize parallel application behavior and communications patterns to identify hotspots

Intel® Developer Products - Intel® Cluster Studio XE 2013




Intel® AVX Vector size: 256 bit Data types: • 32 and 64 bit float VL: 4, 8, 16

Intel® MIC Vector size: 512 bit Data types: • 32 and 64 bit integer • 32 and 64 bit float VL: 8,16

X4

Y4

X4◦Y4

X3

Y3

X3◦Y3

X2

Y2

X2◦Y2

X1

Y1

X1◦Y1

0

X8

Y8

X8◦Y8

X7

Y7

X7◦Y7

X6

Y6

X6◦Y6

X5

Y5

X5◦Y5

255

X4

Y4

X4◦Y4

X3

Y3

X3◦Y3

X2

Y2

X2◦Y2

X1

Y1

X1◦Y1

0

X8

Y8

X8◦Y8

X7

Y7

X7◦Y7

X6

Y6

X6◦Y6

X5

Y5

X5◦Y5

X16

Y16

X16◦Y16

…

...

…

511

Illustrations: Xi, Yi & results 32 bit float

Data Parallelism of Intel® Processors (2)




Intel® Cilk™ Plus Array Notation

<array base> [<lower bound>:<length>[:<stride>]]+

A[:] // All of vector A

B[2:6] // Elements 2 to 7 of vector B

C[:][5] // Column 5 of matrix C

D[0:3:2] // Elements 0,2,4 of vector D

+ + + + + + + +

if (a[:] > b[:])

c[:] = d[:] * e[:];

else

c[:] = d[:] * 2;

A simple and elegant solution:

a language construct for vector level parallelism




Vectorization for MIC: A new Target In

pu

t: C

/C+

+/FO

RTR

AN

so

urc

e co

de

Vectorizer: Map vector parallelism

to vector ISA

Intel® SSE Intel® AVX Intel® MIC

Express/expose vector parallelism

Array Notation

SIMD pragma

Vectorization Hints (ivdep/vector pragmas)

Fully Automatic Analysis

Elemental Function

Optimize and Code Gen

Dat

a p

aral

lel p

art

of

Inte

l® C

ilk™

Plu

s ex

ten

sio

n

Vectorizer makes

retargeting easy!




Many-Core Hosted – “Native” Model

• Enabled by –mmic compiler switch

• Fully supported by compiler vectorization, Intel® MKL, OpenMP*, Intel® TBB, Intel® Cilk Plus, Intel® MPI, …

• Might be an option for some applications:

– Needs to fit into memory !

– Should be highly parallel code

• Serial parts are slower on MIC than on host !

– Limited access to external environment like I/O

• Native MIC file system exists in memory only !

• NFS allows external I/O but …




Many-Core Hosted – with MPI

• MPI ranks on Intel® Xeon PhiTM coprocessors(only)

• All messages into/out of Intel® Xeon PhiTM coprocessors

• Programmed as homogenous network of many-core CPUs:

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

20




Heterogeneous Programming

Fortran (CAF)

MKL

TBB

OpenCL

Cilk Plus

C++

Tools

OpenMP

Fortran (CAF)

TBB

OpenCl

Cilk Plus

C++

MKL

Parallel programming is the same on MIC and CPU

OpenMP

Tools

PC

Ie

CPU Executable MIC Native

Executable

Pa

ralle

l

Co

mp

ute

Pa

ralle

l

Co

mp

ute




Heterogeneous Programming

Fortran (CAF)

MKL

TBB

OpenCL

Cilk Plus

C++

Tools

OpenMP

Fortran (CAF)

TBB

OpenCl

Cilk Plus

C++

MKL

Parallel programming is the same on MIC and CPU

OpenMP

Tools

PC

Ie

CPU Executable MIC Native

Executable

Pa

ralle

l

Co

mp

ute

Pa

ralle

l

Co

mp

ute

Offload Directives (Non-Shared Model)

Offload Keywords (Virtual Shared-Memory)




Choices for Offloading Application Code

Two Intel-specific models for offloading are supported ( Intel® Composer 2013 ):

LEO: Language Extensions for Offload for C/C++ and Fortran

Explicit data transfer by compiler directives

MYO: Mine-Your-Ours for C/C++

“Shared virtual memory” – implicit offload controlled by language extensions for variable declaration etc

Offloading and parallelism is orthogonal

Offloading only transfers control to the MIC devices

Parallelism needs to be exploited by a second model (e.g. OpenMP*)

23




LEO: Explicit Offload

Completely realized by directives and ‘attributes’

Ignored by other compilers ( might result in a warning )

Requires ‘bit-wise’ copyable data objects

Programmer designates variables that need to be copied between host and card in the offload directive

C/C++ Example: #pragma offload target(mic) in(data:length(size))

Fortran Example: !dir$ offload target(mic in(a1:length(size))

• Very much influenced accelerator (“target”) extension of coming OpenMP* 4.0 standard #pragma omp target map(to(b:count)) map(from(a:count))

24




C/C++ Extensions for explicit Offload

C/C++ Syntax Semantics

Offload pragma #pragma offload <clauses>

<statement block>

Allow next statement block to execute on Intel® MIC Architecture or host CPU

Keyword for variable & function definitions

__attribute__((target(mic))) Compile function for, or allocate variable on, both CPU and Intel® MIC Architecture

Entire blocks of code #pragma offload_attribute(push,

target(mic))

#pragma offload_attribute(pop)

Mark entire files or large blocks of code for generation on both host CPU and Intel® MIC Architecture

Data transfer #pragma offload_transfer

target(mic)<clauses> Initiates asynchronous data transfer, or initiates and completes synchronous data transfer

Intel® Many Integrated Core Architecture 25




Example

26

float reduction(float *data, int numberOf)

{

float ret = 0.f;

#pragma offload target(mic) in(data:length(numberOf))

{

#pragma omp parallel for reduction(+:ret)

for (int i=0; i < numberOf; ++i)

ret += data[i];

}

return ret;

}

Note: copies numberOf elements to the coprocessor, not numberOf*sizeof(float) bytes – the compiler knows data’s type




Example: Call Intel® MKL on Coprocessor

int main{

// initialize variables …

#pragma offload target(mic) \

in(transa,transb, N, alpha, beta) \

in(A:length(matrix_elements)) \

in(B:length(matrix_elements)) \

inout(C:length(matrix_elements)) \

sgemm(&transa, &transb, &N, &N, &N, &alpha,

A, &N, B, &N, &beta, C, &N);

}

sgemm performs C=beta*C+alpha*A*B, transa and transb regulate the transposition of A and B and the Ns define the sizes of the matrices (see documentation). C is input and output, all others are input only.

MKL will automatically make optimal use of MIC

27




MYO: Implicit Offload

Real language extensions

Not accepted by other compilers

Alternative model for non-compact data objects like a linked list

Programmer marks variables that should be (virtually) shared between host and card

Run-time system automatically maintains coherence at boundary of offloaded code regions

Sample:

_Cilk_shared double foo;

_Offload func(y);

28




MYO Memory Model

Section of memory maintained at the same (!!) virtual address on both the host and coprocessor

Reserving same address range on both devices allows

Seamless sharing of complex pointer-containing data structures

Elimination of user marshaling and data management

Use of simple language extensions to C/C++

Host Memory

MIC

Memory

Offload code

C/C++ executable

Host

Intel® MIC

Same address range




MYO: _Cilk_shared for Data & Routines

Intel® Many Integrated Core Architecture 30

What Syntax Semantics

Function int _Cilk_shared f(int x)

{ return x+1; }

Versions generated for both CPU and card; may be called from either side

Global _Cilk_shared int x = 0; Visible on both sides

File/Function static static _Cilk_shared int

x;

Visible on both sides, only to code within the file/function

Class class _Cilk_shared x {…}; Class methods, members, and and operators are available on both sides

Pointer to shared data int _Cilk_shared *p; p is local (not shared), can point to shared data

A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data

Entire blocks of code #pragma offload_attribute(

push, _Cilk_shared) #pragma

offload_attribute(pop)

Mark entire files or large blocks of code _Cilk_shared using this pragma




Example // Shared variable declaration for pi

_Cilk_shared float pi;

int main()

{

int count = 10000;

// Initialize shared global

// variables

pi = 0.0f;

// Compute pi on target

_Offload compute_pi(count);

pi /= count;

}

31

_Cilk_shared void compute_pi(int count)

{

int i;

#pragma omp parallel for \

reduction(+:pi)

for (i=0; i<count; i++)

{

float t = (float)((i+0.5f)/count);

pi += 4.0f/(1.0f+t*t);

}

}




Offload Models plus MPI

• MPI ranks on Intel® Xeon® processors (only)

• All messages into/out of processors

• Offload models used to accelerate MPI ranks

• Homogenous network of hybrid nodes:

Xeon

MIC

Xeon

MIC

Xeon

MIC

Xeon

MIC

Network

Data

Data

Data

Data

MPI

Offload

32




LLVM* Standard Passes

LLVM* Vectorizer: • Scalarizer • Divergence Analysis • Predicator • Packetizer • bypasses

LLVM* OpenCL Passes: • Barriers • Builtins • Kernel Arguments

LLVM* Standard Passes

LLVM* IR

LLVM* IR

LLVM* IR

Intel® Xeon Phi™ Coprocessor OpenCL* Compiler

OpenCL* LLVM Compiler Optimizer

Clang*

Code Generator

Xeon Phi Code

LLVM* IR

OpenCL*




Intel® VTune™ Amplifier XE 2013 for Intel® Xeon Phi™

34 7/23/2013

Beginning with update 4, the methodology in this presentation is implemented in the “General

Exploration” profile

Select which problem areas you want to analyze

Events to collect will be configured automatically




MIC Debugging with GDB* • Run Host-MIC GDB* on your localhost (you can’t use default

host gdb!)

/usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb

Start gdbserver on the Intel® Xeon Phi™Coprocessor

• To remote debug using pipe to ssh (gdb) target extended-remote | ssh –T mic0 ~/gdbserver –multi IP:port

• To remote debug using stdio (gdb) target extended-remote | ssh -T mic0 ~/gdbserver –multi -

To attach to a running application via the process-id (pid)

(gdb) file /local/path/to/application

(gdb) attach <remote-pid>

To run an application directly from GDB* (gdb) file /local/path/to/application

(gdb) set remote exec-file /target/path/to/application




Extension to Microsoft Windows* OS

• Microsoft Window being added as second host operating system

– Windows 2008 Server

– No change for operating system of coprocessor – remains Linux

• Developer environment same as for Linux

– Same programming models

• In beta testing today

• To be released H2/2013

36




Future Product Line: Knights Landing

• Knights Landing is the code name for the 2nd generation product for the Intel® Many Integrated Core Architecture

• Knights Landing targets Intel’s 14nm manufacturing process

• Kights Landing will be productized as a processor (running the host OS) and as a coprocessor ( a PCI end-point device )

• Knights Landing will feature on-package high-bandwidth memory

37




Agenda

• Introduction



• Outlook

• Summary

38




Summary Intel® Xeon Phi™ coprocessor is a real product now !

The flexibility of the programming models offer a solution for all needs: From a single MIC card in a workstation up to thousands of MIC cards attached to HPC cluster nodes

Compilers and other tools make it easy to develop or port code to run applications natively on the coprocessors, heterogeneously on host CPU + Intel® MIC systems and collections of these connected in a compute cluster

No new parallel programming models needed: All Intel-supported models available for MIC too – including innovative models like Coarray-Fortran

Using MIC is a simple extension of CPU programming




Get Educated on Intel® Xeon Phi™ Coprocessors • http://software.intel.com/mic-developer

Intel® Xeon Phi™ Coprocessor (codename Knights Corner): http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

• Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Reference Manual: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding

• An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors: http://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors_1.pdf

• Intel® Manycore Platform Software Stack (MPSS): http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss

• Intel® Xeon Phi™ Coprocessor Developer's Quick Start Guide: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide

• Intel® Xeon Phi™ Coprocessor System Software Developers Guide: http://software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemsoftwaredevelopersguide.pdf

• Intel and Third Party Tools and Libraries available with support for Intel® Xeon Phi™ Coprocessor: http://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm

• Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors - Part 1: Optimization Essentials: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization

• Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors, Part 2: Understanding and Using Hardware Events: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding

Tools That Enable You to be Ready!


http://software.intel.com/mic-developer




http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

















http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding



























http://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors_1.pdf































http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss















http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide



















http://software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemsoftwaredevelopersguide.pdf











http://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm































http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization























































and/or other countries. *Other names and brands may be claimed as the property of others. 41




Vector Instruction Performance

VPU contains 16 SP ALUs, 8 DP ALUs,

Most VPU instructions have a latency of 4 cycles and TPT 1 cycle

Load/Store/Scatter have 7-cycle latency

Convert/Shuffle have 6-cycle latency

VPU instruction are issued in u-pipe

Certain instructions can go to v-pipe also

Vector Mask, Vector Store, Vector Packstore, Vector Prefetch, Scalar

43




Spectrum of Execution Models

General purpose serial and parallel

computing

Codes with highly- parallel phases

Highly-parallel codes

Codes with balanced needs

CPU-Centric Intel® MIC-Centric

Manycore Hosted Symmetric Offload Multi-core Hosted

Main( ) Foo( ) MPI_*()

Foo( )


Main() Foo( ) MPI_*()



Multi-core

Many-core

Intel® Many Integrated Core (MIC)

Codes with serial phases

Reverse Offload


Foo( )

PCIe

Intel® Xeon Processor

Supported with Intel Tools

|---- Heterogenous Models -----|


Intel® Xeon Phi™ Coprocessor - Software...

Documents

Transcript of Intel® Xeon Phi™ Coprocessor - Software...