Agenda AMD Opteron 6200 and 4200 Series Resources for...

23
1 | Stanford HPC Workshop 2011 Agenda AMD Opteron 6200 and 4200 Series Processor Overview AMD HPC Ecosystem and Developer Information Resources for Developers and Customer Benchmarking

Transcript of Agenda AMD Opteron 6200 and 4200 Series Resources for...

1 | Stanford HPC Workshop 2011

Agenda

AMD Opteron 6200 and 4200 Series

Processor Overview

AMD HPC Ecosystem and Developer

Information

Resources for Developers and Customer

Benchmarking

2 | Stanford HPC Workshop 2011

INTRODUCING THE NEW AMD OPTERON 6200 AND 4200

SERIES PROCESSORS

AMD Opteron™ 6200 Series Processor

(“Interlagos”)

Scalability Up to 4 sockets with up to 16 cores Up to 2 sockets with up to 8 cores

Memory 4 memory channels up to 1600 MHz memory 2 memory channels up to 1600 MHz memory

Frequency Up to 3.3 GHz base frequency & up to 3.7 GHz frequency using AMD Turbo CORE technology*

Cache

Cache: L1 - 16KB data per core + 64KB

instruction

per module;

L2 - 1MB per core; L3 - 16MB per socket

Cache: L1 - 16KB data per core + 64KB

instruction per module; L2 - 1MB per core;

L3 – 8MB per socket

I/O Four x16 HyperTransport™ technology 3.0 links

@ up to 6.4GT/s per link

Three x16 HyperTransport™ technology 3.0

links

@ up to 6.4GT/s per link

Power 85W to 140 W TDP (Consistent with AMD

Opteron 6100 Series)

35 to 95W TDP (Consistent with AMD

Opteron 4100 Series)

AMD Opteron™ 4200 Series Processor

(“Valencia”)

The world’s first x86

16-core processor1

The world’s lowest x86

power-per-core2

1 Based on 16-core AMD Opteron 6200 Series processor compared to 6-core Intel Xeon 5600 Series and 10-core Intel Xeon E7 processors. 2 As of Nov 1, 2011, AMD Opteron™ processor Models 4200 EE have the lowest known power per core of any x86 server processor, at 35W TDP (35W /8 = 4.375W/core).

Intel 's lowest power per core server processor, L5630, is 40W TDP (40W/4 = 10W/core). See http://www.intel.com/Assets/PDF/prodbrief/323501.pdf. Previous record held

by AMD Opteron processor Models 4100 EE at 35W TDP / 6 cores = 5.83 W/core.

3 | Stanford HPC Workshop 2011

“BULLDOZER” MODULE TECHNOLOGY

Full Performance From Each Core

Leadership Multi-Threaded Micro-Architecture

Shared Double-sized FPU

Amortizes very powerful 256-bit unit across

both cores

Improved IPC

Micro-architecture and ISA enhancements

SSE4.1/4.2, AVX, FMA4, SSSE3, XOP

Virtualization Enhancements

Faster switching between VMs

AMD-V extended migration support

High Frequency / Low-Power Design

Core Performance Boost

“Boosts” frequency of cores when available

power allows

No idle core requirement

Power efficiency enhancements

Significantly reduced leakage power

More aggressive dynamic power mgt

Dedicated execution units per core

No shared execution units as with SMT

Dedicated Components

Shared at the module level

Shared at the chip level

4 | Stanford HPC Workshop 2011

FLEX FP 256-BIT FPU AND NEW "BULLDOZER" INSTRUCTIONS

• A flexible floating point unit shared between 2 integer cores

• Simultaneously executes two 128-bit instructions or one 256-bit instruction

• Saves die space and conserves power for majority of non-FP applications

• Dedicated floating point scheduler, which minimizes latency for floating point applications

Instructions Applications/Use Cases

SSSE3, SSE4.1,

SSE4.2

(AMD and Intel)

• Video encoding and transcoding

• Biometrics algorithms

• Text-intensive applications

AESNI

PCLMULQDQ

(AMD and Intel)

• Application using AES encryption

• Secure network transactions

• Disk encryption (MSFT BitLocker)

• Database encryption (Oracle)

• Cloud security

AVX

(AMD and Intel)

Floating point intensive applications:

• Signal processing / Seismic

• Multimedia

• Scientific simulations

• Financial analytics

• 3D modeling

FMA4

(AMD Unique)

• Vector and matrix multiplications

• Polynomial evaluations

• Chemistry, physics, quantum

mechanics and digital signal

processing

XOP

(AMD Unique)

• Numeric applications

• Multimedia applications

• Algorithms used for audio/radio

5 | Stanford HPC Workshop 2011

DESIGNED TO DRIVE DOWN POWER REQUIREMENTS

Low and Ultra Low Voltage Memory

1.35v DIMMs reduce voltage by 10%;

1.25v DIMMs reduce voltage by 16%¹

Reduces Idle CPU

Power By Up to 46%²

C6 power state

Shuts down clocks and power to idle cores

Intelligent Circuit

Design

All New Design

Minimizes the number of active transistors for lower power and better

performance

Enables More Power

Control for IT

TDP Power Cap

Flexibility to set power limits without capping

frequency

Up to 56% better power-per-core than Xeon³

¹Regular voltage=1.5v, low voltage=1.35v, ultra-low voltage=1.25v; ² ² Based on internal testing as of 8/2011: AMD Opteron™ processor model 6174 (12-core 2.2GHz) consumes 11.7W in

active idle C1E power state, while AMD Opteron™ processor model 6276 (16-core 2.3GHz) consumes only 6.4W in the active idle C1E power state with new C6 power gating employed.

System configuration: “Drachma” reference design kit, 32GB (8 x 4GB DDR3-1333) memory, 500GB SATA disk drive, Microsoft® Windows Server® 2008 x64 Enterprise Edition R2. SVR-

60; ³based on AMD Opteron 4200 Series processor with 8 cores at 35W TDP versus lowest wattage, highest core Intel Xeon processor with 6 cores at 60W TDP.

More Low Power

Memory Choices

6 | Stanford HPC Workshop 2011

AMD TURBO CORE TECHNOLOGY

*Based on AMD Opteron™ 6200 Series processors with up to 300 MHz in P1 boost state and up to 1 GHz+ in P0 boost state over base P2 clock frequency.

Base frequency with

TDP headroom

All core boost activated

(up to 500MHz)

Max turbo activated

(up to 1GHz+, half cores)

All Core Boost

When there is TDP headroom in a

given workload, AMD Turbo CORE

technology is automatically activated

and can increase clock speeds by

300-500 MHz* across all cores.

Max Turbo Boost

When a lightly threaded workload sends half the

“Bulldozer” modules into C6 sleep state but also

requests max performance, AMD Turbo CORE

technology can increase clock speeds by up to

1 GHz+* across half the cores.

+

7 | Stanford HPC Workshop 2011

THE HPC LEADER

¹-3 See complete benchmark data on slides 27-29.

Customer Requirements:

Scalable performance

Strong floating point performance

High memory throughput

More cores for highly threaded apps

Wide range of technical instructions

Linux OS

Open64

GCC

PGI Compilers

24-88% better performance at

significantly lower price¹

With almost twice the FLOPs

per sq. ft. with “Interlagos”, it

would take Intel almost 2 racks

to match AMD in density and

performance²

Greatest FLOPs per Sq. Foot

Superior Performance¹

HPC

-

50

100

150

200

SP

EC

FP

ST

RE

AM

LIN

PA

CK

LA

MM

PS

NA

MD

WR

F

Xeon 5670 AMD Opteron 6276

73GB/s memory

throughput3

73% more memory

bandwidth than Intel3

Maximum cores

per rack2

More FLOPs per sq. foot2

33% lower cost per core4

HPC ISV ECOSYSTEM &

DEVELOPER INFORMATION

Scot Schultz, Sr. Strategic Alliance Manager, HPC

[email protected]

9 | Stanford HPC Workshop 2011

AMD STRATEGIC HPC ISV COMMERCIAL PARTNER

ECOSYSTEM INCLUDES LEADING GLOBAL FORTUNE 500

COMPANIES…

10 | Stanford HPC Workshop 2011

DESIGN/SIMULATION – EVERY CUSTOMER IS UNIQUE

Ansys Mechanical

Workbench

Algor

Moldflow

CFDesign

NX Mechatronics

NX Nastran

CATIA/DELMIA SIMULIA

Abaqus/Standard

MSC Nastran

LS-DYNA Implicit

LS-DYNA Explicit

Marc, Dytran, Adams

STAR products

HyperWorks RADIOSS Acusolve

PAM-series

PowerFLOW

XFlow

Fluent CFX

COMSOL Multiphysics

CoreTech Moldex3D

DEM Solutions EDEM

Flow Science Flow3D

Impetus AFEA

Metariver Technology SAMADAII

Wolfram Mathemathica

From conceptual design to simulation – customers depend on many ISV packages

Adoption of AMD technology ensures optimized multi-disciplinary,

concurrent workflows and accelerated design practices.

11 | Stanford HPC Workshop 2011

HPC ISV ECOSYSTEM ENGAGEMENT STRATEGY

AMD is engaged with Commercial HPC ISV partners and Open Source Software applications world-wide

• Focus is on initial performance, tuning and identifying optimization opportunities with ISV receiving development platforms

AMD is engaged with interconnect partners, such as Mellanox Technologies ConnectX® technology

• Focus is on ensuring drivers and OFED releases are interoperable

and optimized

AMD is engaged with community of middleware, MPI, job schedulers and virtual SMP technology companies

• Focus is on ensuring successfully tuned and optimized solutions on

AMD hardware

12 | Stanford HPC Workshop 2011

OPTIMIZING HPC APPLICATIONS… START HERE!

Application Optimization by recompiling with

optimized compiler and tuning for new architecture

Application optimization by linking to ACML 5.x

(AMD Core Math Library)

Open Source Software optimization by

recompiling with optimized compiler

13 | Stanford HPC Workshop 2011

APPLICATION OPTIMIZATION BY LINKING TO ACML

ACML 5.0 Overview

Full implementation of Level 1, 2 and 3 Basic

Linear Algebra Subroutines (BLAS), with key

routines optimized for high performance on AMD

Opteron™ processors.

A full suite of Linear Algebra (LAPACK) routines.

As well as taking advantage of the highly-tuned

BLAS kernels, a key set of LAPACK routines has

been further optimized to achieve considerably

higher performance than standard LAPACK

implementations.

A comprehensive suite of Fast Fourier

Transforms (FFTs) in both single-, double-,

single-complex and double-complex data types.

Random Number Generators in both single- and

double-precision.

Compiler Support

• Absoft Pro Fortran

• GFORTRAN

• Intel Fortran (Linux,

Windows)

• NAG Fortran

• Open64

• PGI Fortran (Linux,

Windows)

For more information on ACML, go to: http://developer.amd.com/libraries/acml/pages/default.aspx

14 | Stanford HPC Workshop 2011

APPLICATION OPTIMIZATION BY LINKING TO ACML

Linear Algebra

Fast Fourier

Transforms

(FFT)

Others Compiler Support

ACML 5.0

(Aug 2011)

• SGEMM (single

precision)

• DGEMM (double

precision)

• L1 BLAS

• Complex-to-

Complex (C-C)

single precision

FFTs

• Random

Number

Generators

• AVX

compiler

switch for

Fortran

• Absoft

• GCC 4.6

• Open64 4.2.5

• PGI 11.8, 11.9

• ICC 12

• Cray to begin deployment

of ACML with their

compiler with ACML 5.0

ACML 5.1

(Dec 2011)

• CGEMM (complex

single decision)

• ZGEMM (complex

double precision)

• Real-to-complex

(R-C) single

precision FFTs

• Double precision

C-C and R-C FFTs

All compilers listed for

ACML 5.0 will be

supported

For additional information on ACML, go to:

http://developer.amd.com/libraries/acml/pages/default.aspx

15 | Stanford HPC Workshop 2011

APPLICATION OPTIMIZATION BY RECOMPILE

* Additional information: http://developer.amd.com/tools/open64/Documents/open64.html

“Bulldozer” compiler optimizations enabled by –march=bdver1*

• Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP)

• Automatically selects instructions to improve performance (intrinsics and inline)

• Automatic calls to libM (math library) functions that use these new instructions

• Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes

• Adjusted to take advantage of the improved hardware prefetcher

• Improvements in code layout and alignment to take advantage of shared compute

unit, e.g. “dispatch scheduling”

Production quality code generation tool designed for

high performance parallel computing workloads and

enabling the developer to build and optimize C, C++,

and Fortran applications targeting x86 Linux platforms

16 | Stanford HPC Workshop 2011

95%

100%

105%

110%

115%

120%

125%

4.2.3 barcelona +bdver1 +AVX +FMA4

Relative Improvements Polyhedron 2011 Benchmark using Open64 4.2.5.2 Compiler and AMD

OpteronTM Model 6272 processor*

O3

Ofast

More info on Polyhedron 2011: http://www.polyhedron.com More info on Open64: http://developer.amd.com/open64

APPLICATION OPTIMIZATION BY RECOMPILING

17 | Stanford HPC Workshop 2011

APPLICATION OPTIMIZATION BY RECOMPILING

Additional information: http://developer.amd.com/tools/gnu/pages/default.aspx

“Bulldozer” compiler optimizations enabled by –march=bdver1

• Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP)

• Automatically selects instructions to improve performance (intrinsics and inline)

• Scalar and vector libm calls available with AMD Libm

• Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes

• Memset/Memcpy inliner heuristics

• Defaults to 128-bit vectorization

• Improvements in code layout and alignment

Available from the Free Software Foundation (FSF)

and offering support for the latest AMD processor-

based platforms. GCC 4.6 includes support for AMD

Opteron 4200 and 6200 Series processors.

18 | Stanford HPC Workshop 2011

APPLICATION OPTIMIZATION BY RECOMPILING

Compiler Status

SSSE3 SSE4.1-.2

AVX

FMA4 XOP

Auto Generates

Code Comments

GCC 4.6.2 Available GCC 4.4 is included in RHEL 6.0

distribution and should be updated to

GCC 4.6.2 for optimized support

Microsoft Visual Studio 2010 SP1 Available No

Supports new instructions but does not

auto generate code

Open64 4.2.5 Available http://developer.amd.com/open64

Open64 4.5 Planned for

Dec 2011

Will provide incremental performance and

functionality improvements

PGI 11.9 Available

PGI Unified Binary™ technology

combines into a single executable or

object file code optimized for multiple AMD

and Intel processors

http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf Compiler Optimization Quick Guide:

Compiler Support Summary

19 | Stanford HPC Workshop 2011

The central resource for tools, technologies, best practices, and expert

guidance to optimize your software solution performance on AMD platforms.

AMD DEVELOPER CENTRAL // CODE FASTER, FASTER CODE

RESOURCES FOR ISV

DEVELOPERS AND CUSTOMER

BENCHMARKING

Scot Schultz, Sr. Strategic Alliance Manager, HPC

[email protected]

21 | Stanford HPC Workshop 2011

LATEST AVAILABLE AMD OPTERON™ CLUSTER RESOURCES

Vesta

(11) Dell™ PowerEdge R815 Compute Nodes, AMD Opteron™ 6276 Compute Nodes

704 CPU Cores, 128GB DDR3-1333/node, Dual Mellanox ConnectX®-2 40Gb/s IB

Request access at http://www.hpcadvisorycouncil.com

Dodecas

(8) AMD 6000 Series Platforms, AMD Opteron™ 6276 Compute Nodes + AMD ATI FirePro

7800 GPU

256 CPU Cores, 32GB DDR3-1066/node, Mellanox ConnectX ® 40Gb/s InfiniBand

Request access at http://www.hpcadvisorycouncil.com

Mercury

(6) Dell™ C6145 Compute Nodes, AMD Opteron™ 6276 Compute Nodes

384 CPU Cores, 128GB DDR3-1333/node, Dual Mellanox ConnectX®-2 40Gb/s IB

Request access at http://www.hpcadvisorycouncil.com

22 | Stanford HPC Workshop 2011

REFERENCES

x86 Compiler Quick Reference Guide for “Bulldozer” processors

http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf

Using the x86 Open64 Compiler Suite

http://developer.amd.com/tools/open64/Documents/open64.html

x86 Open64 4.2.5.2 Release Notes

http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt

ACML 5.0 Information

http://developer.amd.com/libraries/acml/features/pages/default.aspx

Software Optimization Guide for “Bulldozer” processors

http://support.amd.com/us/Processor_TechDocs/47414.pdf

AMD64 Architecture Programmer’s Manual : 128-Bit and

256-Bit XOP and FMA4 Instructions

http://support.amd.com/us/Embedded_TechDocs/43479.pdf

23 | Stanford HPC Workshop 2011

Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States

and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of

their respective owners.

©2011 Advanced Micro Devices, Inc. All rights reserved.