Performance Analysis of Computer Systems...– Architecture and performance analysis of High...

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Performance Analysis of Computer Systems

Introduction

Organization

Lecture: Every Wednesday in INF E001 from 13:00 to 14:30

Labs: Every Thursday in INF E046 from 13:00 to 14:30

First Exercise: October 21st, guided tour through all machine rooms at ZIH

– Meeting point: Treffz-Bau, below overbridge,

All slides will be in English

Ten minute summary of last lecture at the beginning of each lecture

List of attendees

Slide 2 LARS: Introduction and Motivation

Class Material on the Web

Slides will be put on the web prior or shortly after each class

The slides from last year are still online.

– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws0910/lars

Be aware of upgrades for this term.

– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1011/lars


Class Outline

15 lectures with 14 corresponding exercises

Class structure

– Introduction and motivation

– Performance requirements, metrics, and common evaluation mistakes

– Workload types, selection, and characterization

– Commonly used benchmarks

– Monitoring techniques

– Capacity planning for future systems

– Performance data presentation

– Summarizing measured data

– Regression models

– Experimental design

– Performance simulation and prediction

– Introduction to queuing theory


Literature

Raj Jain: The Art of Computer Systems Performance Analysis

John Wiley & Sons, Inc., 1991 (ISBN: 0-471-50336-3)

Rainer Klar, Peter Dauphin, Fran Hartleb, Richard Hofmann, Bernd Mohr, Andreas Quick, Markus Siegle Messung und Modellierung paralleler und verteilter Rechensysteme B.G. Teubner Verlag, Stuttgart, 1995 (ISBN:3-519-02144-7)

Dongarra, Gentzsch, Eds.: Computer Benchmarks, Advances in Parallel Computing 8, North Holland, 1993 (ISBN: 0-444-81518-x)




Introduction and Motivation

Why is Performance Analysis Important?

Overview

Development of hardware performance

Implications on application performance

Compute power at Technische Universität Dresden

Research at ZIH

Some advertising


Moore’s Law: 2X Transistors / “year”

“Cramming More Components onto Integrated Circuits”

Gordon Moore, Electronics, 1965

# on transistors / cost-effective integrated circuit double every N months (18 N 24)


Performance Development in TOP500


John Shalf (NERSC, LBNL)


Number of Cores per System is Increasing Rapidly

Total # of Cores in Top15

0

200000

400000

600000

800000

1000000

1200000

Ju

n 9

3

De

z 9

3

Ju

n 9

4

De

z 9

4

Ju

n 9

5

De

z 9

5

Ju

n 9

6

De

z 9

6

Ju

n 9

7

De

z 9

7

Ju

n 9

8

De

z 9

8

Ju

n 9

9

De

z 9

9

Ju

n 0

0

De

z 0

0

Ju

n 0

1

De

z 0

1

Ju

n 0

2

De

z 0

2

Ju

n 0

3

De

z 0

3

Ju

n 0

4

De

z 0

4

Ju

n 0

5

De

z 0

5

Ju

n 0

6

De

z 0

6

Ju

n 0

7

De

z 0

7

Ju

n 0

8

De

z 0

8

Pro

cesso

rs


Number of Cores per System is Increasing Rapidly


Cray XT5 (Jaguar) at Oak Ridge National Laboratory


Dawning Nebulae at NSCS

Number two in TOP 500 (June 2010)

Installed at National Supercomputing Centre in Shenzhen (China)

Specification not published

Hybrid architecture

Presumably: 4640 nodes with each

– Two Intel Xeon X5650 processor (10.64 GFLOPS)

– One Nvidia C2050 GPU

Total number of cores

– 4640 nodes * (12 processor cores + 14 shader cluster) = 120640 cores


IBM Roadrunner at Los Alamos National Laboratory

First computer to surpass the 1 Petaflop (250 FLOPS ) barrier

Installed at Los Alamos National Laboratories

Hybrid Architecture

13,824 AMD Opteron cores

116,640 IBM PowerXCell 8i cores

Costs: $120 Mio.


IBM BlueGene/P (JUGENE) at Research Centre Jülich

Number five in TOP 500

Installed at Forschungszentrum Jülich

72 Racks with 32 node cards x 32 compute cards (total 73728)

294,912 PowerPC 450, 850 MHz

144 TB main memory


What Kind of Know-How is Required for HPC?

Algorithms and methods

Performance Analysis

Programming (Paradigms and details of implementations)

Operation of supercomputers (network, infrastructure, service, support)


Challenges

Languages

– Fortran95, C/C++, Java,

– Also scripting languages!

Parallelization:

– MPI, OpenMP

Network

– Ethernet, Infiniband, Myrinet, …

Scheduling

– Distributed components, job scheduling, process scheduling

System architecture

– Processors, memory hierarchy




Application Performance

From Modeling to Execution


Short History of X86 CPUs

CPU Year Bit

Width

#Transistors Clock Structure L1 / L2 /L3

4004 1971 4 2300 740 kHz 10 micro

8008 1972 8 3500 500 kHz 10 micro

8086 1978 16 29.000 10 Mhz 3 micro

80286 1982 16 134.000 25 MHz 1.5 micro

80386 1985 32 275.000 33 Mhz 1 micro

80486 1989 32 1.200.000 50 MHz 0.8 micro 8K

Pentium I 1994 32 3.100.000 66 MHz 0.8 micro 8K

Pentium II 1997 32 7.500.000 300 MHz 0.35 micro 16K/512K*

Pentium III 1999 32 9.500.000 600 MHz 0.25 micro 16K/512K*

Pentium IV 2000 32 42.000.000 1.5 GHz 0.18 micro 8K/256K

P IV F 2005 64 2.8- 3.8 GHz

90 nm 16K/2MB

Core i7 2008 64 781.000.000 3.2 GHz 45 nm 32K/256K/8MB


Intel Nehalem

Released 2008

4 cores

781.000.000 transistors

45nm technology

32 K L1Data, 32K L1Instruction

256 K L2

8 MB shared L3 cache

Hyperthreading

3.2 GHz*4 cores*4 FLOPS/cycle = 51.2 Gflop/s peak

Integrated memory controller

QPI between processors


Nehalem Core

Execution

Units

Out-of-Order

Scheduling & Retirement

L2 Cache

& Interrupt Servicing

Instruction Fetch

& L1 Cache

Branch Prediction Instruction

Decode & Microcode

Paging

L1 Data Cache

Memory Ordering

& Execution


Potential factors limiting performance

“Peak performance”

Floating point units

Integer units

… any other feature of micro architecture

Bandwidth (L1,L2,L3, main memory, other cores, other nodes)

Latency (L1,L2,L3, main memory, other cores, other nodes)


Performance development in TOP500


Develops the rest of the system at CPU speed?

μProc 60%/yr. (2X/1.5yr)

DRAM 9%/yr. (2X/10 yrs) 1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Perform

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)


Performance Trends measured by SPECint

Source: Hennessy, Patterson: „Computer Architecture, a quantitative approach“.


CPUint2006 development 2005 - 2009


Performance Trends measured by SPECint

2009

23%


CPUfp2006 development 1991 - 2009

CPU 95

Released 1995

602 results between 3/1991 and 1/2001

CPUfp2000

Released 2000


CPUfp2006

Released 2006


42%

33%

30%


Performance Trends over a 20 years life cycle


Performance Trends over a 20 years life cycle

Where is your

application?




Center of Information Services and HPC

A short introduction

HPC in Germany


Responsibilities of ZIH

Providing infrastructure and qualified service for TU Dresden and Saxony

Research topics

– Architecture and performance analysis of High Performance Computers

– Programming methods and techniques for HPC systems

– Software tools to support programming and optimization

– Modeling algorithms of biological processes

– Mathematical models, algorithms, and efficient implementations

Role of mediator between vendors, developers, and users

Pick up and preparation of new concepts, methods, and techniques

Teaching and Education


Compute Server Infrastructure

HPC - Komponente

Hauptspeicher 6,5 TB PC - Farm

HPC - SAN

Festplatten - kapazität :

68 TB

PC - SAN

Festplatten - kapazität :

68 TB

PetaByte - Bandarchiv

Kapazität : 1 PB

8 GB / s 4 GB / s 4 GB / s

1 , 8 GB / s HPC-Component

– SGI® Altix® 4700

– 2048 of

– MonteCito Cores

– 6.5 TByte main memory

PC-Farm

System from Linux Networx

AMD opteron CPUs (dual core, 2.6 GHz)

728 boards with 2592 cores

Infiniband networks between the nodes


HPC-System: SGI Altix 4700 (Mars)

32 x 42U Racks

1024 x Sockets with Itanium2 Montecito Dual-

Core CPUs (1.6 GHz/9MB L3 Cache)

13 TFlop/s peak performance

11.9 TFlop/s linpack

6.5 TB shared memory


Linux Networx PC-Farm (Deimos)

– 26 water cooled racks (Knürr)

– 1296 AMD Opteron x85 Dual-Core CPUs (2,6 GHz)

– 728 compute nodes with 2 (384), 4 (232) or 8 (112) cores

– 2 Master- und 11 Lustre-Server

– 2 GB memory per core

– 68 TB SAN disc (RAID 6)

– Local scratch discs (70, 150, 290 GB)

– 2 4x-Infiniband Fabrics (MPI + I/O)

– OS: SuSE SLES 10

– Batch system: LSF

– Compiler: Pathscale, PGI, Intel, Gnu

– ISV-Codes: Ansys100, CFX, Fluent, Gaussian, LS-DYNA, Matlab, MSC


Computer Rooms – Extension to the Building


Performance of Supercomputers at ZIH




Research at ZIH

Selected Projects and Activities

Forschungsbereiche am ZIH

Software-Werkzeuge zur Unterstützung von Programmierung und Optimierung

Programmiermethoden und Techniken für Hochleistungsrechner

Grid-Computing

Mathematische Methoden, Algorithmen und effiziente Implementierungen

Architektur und Leistungsanalyse von Hochleistungsrechnern

Algorithmen und Methoden zur Modellierung biologischer Prozesse


Software-Werkzeuge …

Vampir

– Visualisierung und Analyse von parallelen Anwendungen

Marmot

– Erkennung von fehlerhafter Nutzung der MPI Kommunikationsbibliothek

ParBench

– Analyse von Multiprogramming Eigenschaften

BenchIT

– Ausführung/Archivierung/Darstellung von Benchmarks und deren Ergebnisse

Screenshots: Marmot for Windows


Vampir: Framework


Vampir: Timelines


Vampir: Summaries


BenchIT

BenchIT measurement core

Command line interface

GUI

Website


Cluster Challenge 2008

Herausforderung:

– 6 Studenten

– 44 Stunden

– 1 (selbst zusammengestellter) Cluster mit max. 3,1 kW Leistungsaufnahme

– 5 wissenschaftliche Anwendungen

Ziel:

– Maximaler Durchsatz an Jobs innerhalb der Wettkampfzeit

Teilnehmerfeld:

Purdue University mit SiCortex, Univerity of Alberta mit SGI, TUD/IU mit IBM & Myricom, Taiwan mit HP, Arizona State mit Cray/MS, Colorado mit Aspen Systems, MIT mit Dell



Hardware-Optimierungen

– 10G Myrinet Interconnect (~120W für Switch + Host Adapter)

– Optimale DIMM Konfiguration für die Anwendungen (16 GB pro Knoten)

– Booten von USB-Sticks und Nutzen der lokalen Platten nur wenn nötig

– Bestimmen der Stromverbrauchsprofile der Anwendungen, um die “richtige” Gesamtknotenzahl zu wählen

Software-Optimierungen

– Wo sinnvoll, Einsatz kommerzieller Compiler (signifikanter Aufwand)

– Tracing der Anwendungen, um Kommunikation zu verstehen und zu optimieren

Durchsatz-Optimierungen

– Nutzen der Stromverbrauchs- und Laufzeitabschätzungen zur optimalten Auslastung des Clusters

Ergebnis: 1. Platz




Das ZIH als Arbeitgeber

Infrastruktur

Hochleistungsrechner:

Arbeitsplätze:


Internationale Zusammenarbeit

Tracing

ParMA

VI HPS

Open MPI


Zukunftsaussichten

In der Many-Core Ära wird paralleles Rechnen immer wichtiger

Kontakte zu internationalen Partnern

Industriekontakte: IBM, SUN, Cray, SGI, NEC; Intel, AMD, …

Mögliche Auslandsaufenthalte oder Industrieinternships

– Beispiele für Auslandsaufenthalte

• LLNL, CA, U.S.A.

• BSC, Barcelona, Spain

• Eugene, OR, U.S.A.

– Beispiele für Internships:

• Cray

• IBM




Diplomarbeiten am ZIH

Evaluierung der GCC Plug-In Schnittstelle

Thema: Evaluierung der neuen Plug-In Schnittstelle des GCC im Hinblick auf die Instrumentierung von HPC Programmen

Fragestellung:

– Welche Neuerungen und Vorteile bietet der Plug-In Mechanismus?

– Wie können GCC Plug-Ins zur Instrumentierung von HPC Programmen genutzt werden?

– Ist effizientes Filtern zur Laufzeit möglich?

– Vergleich mit konventioneller Instrumentierung

Betreuer: Bert Wesarg ([email protected]


Programmspuranalyse mit Signalverarbeitung

Thema: Evaluierung von Analysemethoden aus der Signalverarbeitung im Hinblick auf Programmspuren

Fragestellung:

– Wie lassen sich Programmspuren sinnvoll auf Signale abbilden?

– Inwieweit eignen sich Methoden der Signalverarbeitung (Sampling, Wavelet Transformation, Korrelation) zur effizienteren Verarbeitung von Leistungsdaten aus Programmspuren?

– Ist eine automatische Mustererkennung und Datengruppierung möglich?

Betreuer: Matthias Weber ([email protected])


Perf.-Analyse für Speedstep-Architekturen

Thema: “Verbesserung der Performance-Analyse für Multicore-Architekturen und Systeme mit Speedstep-Fähigkeiten”

Fragestellung:

– Untersuchung der Möglichkeiten unter Linux, den ausführenden CPU-Kern für einen Prozess zu bestimmen

– Integration der Information in Programmspuren

– Suche einer portablen und nicht intrusiven Lösung, Taktfrequenzänderungen von CPU-Kernen aufzuzeichnen

– Darauf basierend, Normierung von Zeitintervallen in Programmspuren

Betreuer: Jens Doleschal ([email protected])


Performance Analyse und Softwareentwicklung

Thema: Performance Analyse als Integraler Bestandteil der Softwarentwicklung

Fragestellung:

– Integration von Performance Analyse (VAMPIR) in IDE (Eclipse)

– Geeignete Abstraktion und Darstellung von “Performance Summaries”

– Integration von paralleler Performance Analyse in den Softwareentwicklungsprozess

– Betreuer: Matthias Mueller, Andreas Knüpfer




Thank you!

Hope to see you next time…

Performance Analysis of Computer Systems...– Architecture and performance analysis of High...

Documents

Transcript of Performance Analysis of Computer Systems...– Architecture and performance analysis of High...