Performance Analysis of Computer Systems...– Architecture and performance analysis of High...
Transcript of Performance Analysis of Computer Systems...– Architecture and performance analysis of High...
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Performance Analysis of Computer Systems
Introduction
Organization
Lecture: Every Wednesday in INF E001 from 13:00 to 14:30
Labs: Every Thursday in INF E046 from 13:00 to 14:30
First Exercise: October 21st, guided tour through all machine rooms at ZIH
– Meeting point: Treffz-Bau, below overbridge,
All slides will be in English
Ten minute summary of last lecture at the beginning of each lecture
List of attendees
Slide 2 LARS: Introduction and Motivation
Class Material on the Web
Slides will be put on the web prior or shortly after each class
The slides from last year are still online.
– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws0910/lars
Be aware of upgrades for this term.
– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1011/lars
Slide 3 LARS: Introduction and Motivation
Class Outline
15 lectures with 14 corresponding exercises
Class structure
– Introduction and motivation
– Performance requirements, metrics, and common evaluation mistakes
– Workload types, selection, and characterization
– Commonly used benchmarks
– Monitoring techniques
– Capacity planning for future systems
– Performance data presentation
– Summarizing measured data
– Regression models
– Experimental design
– Performance simulation and prediction
– Introduction to queuing theory
Slide 4 LARS: Introduction and Motivation
Literature
Raj Jain: The Art of Computer Systems Performance Analysis
John Wiley & Sons, Inc., 1991 (ISBN: 0-471-50336-3)
Rainer Klar, Peter Dauphin, Fran Hartleb, Richard Hofmann, Bernd Mohr, Andreas Quick, Markus Siegle Messung und Modellierung paralleler und verteilter Rechensysteme B.G. Teubner Verlag, Stuttgart, 1995 (ISBN:3-519-02144-7)
Dongarra, Gentzsch, Eds.: Computer Benchmarks, Advances in Parallel Computing 8, North Holland, 1993 (ISBN: 0-444-81518-x)
Slide 5 LARS: Introduction and Motivation
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Introduction and Motivation
Why is Performance Analysis Important?
Overview
Development of hardware performance
Implications on application performance
Compute power at Technische Universität Dresden
Research at ZIH
Some advertising
Slide 7 LARS: Introduction and Motivation
Moore’s Law: 2X Transistors / “year”
“Cramming More Components onto Integrated Circuits”
Gordon Moore, Electronics, 1965
# on transistors / cost-effective integrated circuit double every N months (18 N 24)
Slide 8 LARS: Introduction and Motivation
Performance Development in TOP500
Slide 9 LARS: Introduction and Motivation
John Shalf (NERSC, LBNL)
Slide 10 LARS: Introduction and Motivation
Number of Cores per System is Increasing Rapidly
Total # of Cores in Top15
0
200000
400000
600000
800000
1000000
1200000
Ju
n 9
3
De
z 9
3
Ju
n 9
4
De
z 9
4
Ju
n 9
5
De
z 9
5
Ju
n 9
6
De
z 9
6
Ju
n 9
7
De
z 9
7
Ju
n 9
8
De
z 9
8
Ju
n 9
9
De
z 9
9
Ju
n 0
0
De
z 0
0
Ju
n 0
1
De
z 0
1
Ju
n 0
2
De
z 0
2
Ju
n 0
3
De
z 0
3
Ju
n 0
4
De
z 0
4
Ju
n 0
5
De
z 0
5
Ju
n 0
6
De
z 0
6
Ju
n 0
7
De
z 0
7
Ju
n 0
8
De
z 0
8
Pro
cesso
rs
Slide 11 LARS: Introduction and Motivation
Number of Cores per System is Increasing Rapidly
Slide 12 LARS: Introduction and Motivation
Cray XT5 (Jaguar) at Oak Ridge National Laboratory
Slide 13 LARS: Introduction and Motivation
Dawning Nebulae at NSCS
Number two in TOP 500 (June 2010)
Installed at National Supercomputing Centre in Shenzhen (China)
Specification not published
Hybrid architecture
Presumably: 4640 nodes with each
– Two Intel Xeon X5650 processor (10.64 GFLOPS)
– One Nvidia C2050 GPU
Total number of cores
– 4640 nodes * (12 processor cores + 14 shader cluster) = 120640 cores
Slide 14 LARS: Introduction and Motivation
IBM Roadrunner at Los Alamos National Laboratory
First computer to surpass the 1 Petaflop (250 FLOPS ) barrier
Installed at Los Alamos National Laboratories
Hybrid Architecture
13,824 AMD Opteron cores
116,640 IBM PowerXCell 8i cores
Costs: $120 Mio.
Slide 15 LARS: Introduction and Motivation
IBM BlueGene/P (JUGENE) at Research Centre Jülich
Number five in TOP 500
Installed at Forschungszentrum Jülich
72 Racks with 32 node cards x 32 compute cards (total 73728)
294,912 PowerPC 450, 850 MHz
144 TB main memory
Slide 16 LARS: Introduction and Motivation
What Kind of Know-How is Required for HPC?
Algorithms and methods
Performance Analysis
Programming (Paradigms and details of implementations)
Operation of supercomputers (network, infrastructure, service, support)
Slide 17 LARS: Introduction and Motivation
Challenges
Languages
– Fortran95, C/C++, Java,
– Also scripting languages!
Parallelization:
– MPI, OpenMP
Network
– Ethernet, Infiniband, Myrinet, …
Scheduling
– Distributed components, job scheduling, process scheduling
System architecture
– Processors, memory hierarchy
Slide 18 LARS: Introduction and Motivation
From Modeling to Execution
Slide 20 LARS: Introduction and Motivation
Short History of X86 CPUs
CPU Year Bit
Width
#Transistors Clock Structure L1 / L2 /L3
4004 1971 4 2300 740 kHz 10 micro
8008 1972 8 3500 500 kHz 10 micro
8086 1978 16 29.000 10 Mhz 3 micro
80286 1982 16 134.000 25 MHz 1.5 micro
80386 1985 32 275.000 33 Mhz 1 micro
80486 1989 32 1.200.000 50 MHz 0.8 micro 8K
Pentium I 1994 32 3.100.000 66 MHz 0.8 micro 8K
Pentium II 1997 32 7.500.000 300 MHz 0.35 micro 16K/512K*
Pentium III 1999 32 9.500.000 600 MHz 0.25 micro 16K/512K*
Pentium IV 2000 32 42.000.000 1.5 GHz 0.18 micro 8K/256K
P IV F 2005 64 2.8- 3.8 GHz
90 nm 16K/2MB
Core i7 2008 64 781.000.000 3.2 GHz 45 nm 32K/256K/8MB
Slide 21 LARS: Introduction and Motivation
Intel Nehalem
Released 2008
4 cores
781.000.000 transistors
45nm technology
32 K L1Data, 32K L1Instruction
256 K L2
8 MB shared L3 cache
Hyperthreading
3.2 GHz*4 cores*4 FLOPS/cycle = 51.2 Gflop/s peak
Integrated memory controller
QPI between processors
Slide 22 LARS: Introduction and Motivation
Nehalem Core
Execution
Units
Out-of-Order
Scheduling & Retirement
L2 Cache
& Interrupt Servicing
Instruction Fetch
& L1 Cache
Branch Prediction Instruction
Decode & Microcode
Paging
L1 Data Cache
Memory Ordering
& Execution
Slide 23 LARS: Introduction and Motivation
Potential factors limiting performance
“Peak performance”
Floating point units
Integer units
… any other feature of micro architecture
Bandwidth (L1,L2,L3, main memory, other cores, other nodes)
Latency (L1,L2,L3, main memory, other cores, other nodes)
Slide 24 LARS: Introduction and Motivation
Performance development in TOP500
Slide 25 LARS: Introduction and Motivation
Develops the rest of the system at CPU speed?
μProc 60%/yr. (2X/1.5yr)
DRAM 9%/yr. (2X/10 yrs) 1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-Memory Performance Gap: (grows 50% / year)
Perform
ance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
Slide 26 LARS: Introduction and Motivation
Performance Trends measured by SPECint
Source: Hennessy, Patterson: „Computer Architecture, a quantitative approach“.
Slide 27 LARS: Introduction and Motivation
CPUint2006 development 2005 - 2009
Slide 28 LARS: Introduction and Motivation
Performance Trends measured by SPECint
2009
23%
Slide 29 LARS: Introduction and Motivation
CPUfp2006 development 1991 - 2009
CPU 95
Released 1995
602 results between 3/1991 and 1/2001
CPUfp2000
Released 2000
1385 results between 10/1996 and 2/2007
CPUfp2006
Released 2006
1217 results between 4/1997 and 4/2009
42%
33%
30%
Slide 30 LARS: Introduction and Motivation
Performance Trends over a 20 years life cycle
Slide 31 LARS: Introduction and Motivation
Performance Trends over a 20 years life cycle
Where is your
application?
Slide 32 LARS: Introduction and Motivation
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center of Information Services and HPC
A short introduction
HPC in Germany
Slide 34 LARS: Introduction and Motivation
Responsibilities of ZIH
Providing infrastructure and qualified service for TU Dresden and Saxony
Research topics
– Architecture and performance analysis of High Performance Computers
– Programming methods and techniques for HPC systems
– Software tools to support programming and optimization
– Modeling algorithms of biological processes
– Mathematical models, algorithms, and efficient implementations
Role of mediator between vendors, developers, and users
Pick up and preparation of new concepts, methods, and techniques
Teaching and Education
Slide 35 LARS: Introduction and Motivation
Compute Server Infrastructure
HPC - Komponente
Hauptspeicher 6,5 TB PC - Farm
HPC - SAN
Festplatten - kapazität :
68 TB
PC - SAN
Festplatten - kapazität :
68 TB
PetaByte - Bandarchiv
Kapazität : 1 PB
8 GB / s 4 GB / s 4 GB / s
1 , 8 GB / s HPC-Component
– SGI® Altix® 4700
– 2048 of
– MonteCito Cores
– 6.5 TByte main memory
PC-Farm
System from Linux Networx
AMD opteron CPUs (dual core, 2.6 GHz)
728 boards with 2592 cores
Infiniband networks between the nodes
Slide 36 LARS: Introduction and Motivation
HPC-System: SGI Altix 4700 (Mars)
32 x 42U Racks
1024 x Sockets with Itanium2 Montecito Dual-
Core CPUs (1.6 GHz/9MB L3 Cache)
13 TFlop/s peak performance
11.9 TFlop/s linpack
6.5 TB shared memory
Slide 37 LARS: Introduction and Motivation
Linux Networx PC-Farm (Deimos)
– 26 water cooled racks (Knürr)
– 1296 AMD Opteron x85 Dual-Core CPUs (2,6 GHz)
– 728 compute nodes with 2 (384), 4 (232) or 8 (112) cores
– 2 Master- und 11 Lustre-Server
– 2 GB memory per core
– 68 TB SAN disc (RAID 6)
– Local scratch discs (70, 150, 290 GB)
– 2 4x-Infiniband Fabrics (MPI + I/O)
– OS: SuSE SLES 10
– Batch system: LSF
– Compiler: Pathscale, PGI, Intel, Gnu
– ISV-Codes: Ansys100, CFX, Fluent, Gaussian, LS-DYNA, Matlab, MSC
Slide 38 LARS: Introduction and Motivation
Computer Rooms – Extension to the Building
Slide 39 LARS: Introduction and Motivation
Performance of Supercomputers at ZIH
Slide 40 LARS: Introduction and Motivation
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Research at ZIH
Selected Projects and Activities
Forschungsbereiche am ZIH
Software-Werkzeuge zur Unterstützung von Programmierung und Optimierung
Programmiermethoden und Techniken für Hochleistungsrechner
Grid-Computing
Mathematische Methoden, Algorithmen und effiziente Implementierungen
Architektur und Leistungsanalyse von Hochleistungsrechnern
Algorithmen und Methoden zur Modellierung biologischer Prozesse
Slide 42 LARS: Introduction and Motivation
Software-Werkzeuge …
Vampir
– Visualisierung und Analyse von parallelen Anwendungen
Marmot
– Erkennung von fehlerhafter Nutzung der MPI Kommunikationsbibliothek
ParBench
– Analyse von Multiprogramming Eigenschaften
BenchIT
– Ausführung/Archivierung/Darstellung von Benchmarks und deren Ergebnisse
Screenshots: Marmot for Windows
Slide 43 LARS: Introduction and Motivation
Vampir: Framework
Slide 44 LARS: Introduction and Motivation
Vampir: Timelines
Slide 45 LARS: Introduction and Motivation
Vampir: Summaries
Slide 46 LARS: Introduction and Motivation
BenchIT
BenchIT measurement core
Command line interface
GUI
Website
Slide 47 LARS: Introduction and Motivation
Cluster Challenge 2008
Herausforderung:
– 6 Studenten
– 44 Stunden
– 1 (selbst zusammengestellter) Cluster mit max. 3,1 kW Leistungsaufnahme
– 5 wissenschaftliche Anwendungen
Ziel:
– Maximaler Durchsatz an Jobs innerhalb der Wettkampfzeit
Teilnehmerfeld:
Purdue University mit SiCortex, Univerity of Alberta mit SGI, TUD/IU mit IBM & Myricom, Taiwan mit HP, Arizona State mit Cray/MS, Colorado mit Aspen Systems, MIT mit Dell
Slide 48 LARS: Introduction and Motivation
Cluster Challenge 2008
Slide 49 LARS: Introduction and Motivation
Cluster Challenge 2008
Hardware-Optimierungen
– 10G Myrinet Interconnect (~120W für Switch + Host Adapter)
– Optimale DIMM Konfiguration für die Anwendungen (16 GB pro Knoten)
– Booten von USB-Sticks und Nutzen der lokalen Platten nur wenn nötig
– Bestimmen der Stromverbrauchsprofile der Anwendungen, um die “richtige” Gesamtknotenzahl zu wählen
Software-Optimierungen
– Wo sinnvoll, Einsatz kommerzieller Compiler (signifikanter Aufwand)
– Tracing der Anwendungen, um Kommunikation zu verstehen und zu optimieren
Durchsatz-Optimierungen
– Nutzen der Stromverbrauchs- und Laufzeitabschätzungen zur optimalten Auslastung des Clusters
Ergebnis: 1. Platz
Slide 50 LARS: Introduction and Motivation
Cluster Challenge 2008
Slide 51 LARS: Introduction and Motivation
Infrastruktur
Hochleistungsrechner:
Arbeitsplätze:
Slide 53 LARS: Introduction and Motivation
Internationale Zusammenarbeit
Tracing
ParMA
VI HPS
Open MPI
Slide 54 LARS: Introduction and Motivation
Zukunftsaussichten
In der Many-Core Ära wird paralleles Rechnen immer wichtiger
Kontakte zu internationalen Partnern
Industriekontakte: IBM, SUN, Cray, SGI, NEC; Intel, AMD, …
Mögliche Auslandsaufenthalte oder Industrieinternships
– Beispiele für Auslandsaufenthalte
• LLNL, CA, U.S.A.
• BSC, Barcelona, Spain
• Eugene, OR, U.S.A.
– Beispiele für Internships:
• Cray
• IBM
Slide 55 LARS: Introduction and Motivation
Evaluierung der GCC Plug-In Schnittstelle
Thema: Evaluierung der neuen Plug-In Schnittstelle des GCC im Hinblick auf die Instrumentierung von HPC Programmen
Fragestellung:
– Welche Neuerungen und Vorteile bietet der Plug-In Mechanismus?
– Wie können GCC Plug-Ins zur Instrumentierung von HPC Programmen genutzt werden?
– Ist effizientes Filtern zur Laufzeit möglich?
– Vergleich mit konventioneller Instrumentierung
Betreuer: Bert Wesarg ([email protected]
Slide 57 LARS: Introduction and Motivation
Programmspuranalyse mit Signalverarbeitung
Thema: Evaluierung von Analysemethoden aus der Signalverarbeitung im Hinblick auf Programmspuren
Fragestellung:
– Wie lassen sich Programmspuren sinnvoll auf Signale abbilden?
– Inwieweit eignen sich Methoden der Signalverarbeitung (Sampling, Wavelet Transformation, Korrelation) zur effizienteren Verarbeitung von Leistungsdaten aus Programmspuren?
– Ist eine automatische Mustererkennung und Datengruppierung möglich?
Betreuer: Matthias Weber ([email protected])
Slide 58 LARS: Introduction and Motivation
Perf.-Analyse für Speedstep-Architekturen
Thema: “Verbesserung der Performance-Analyse für Multicore-Architekturen und Systeme mit Speedstep-Fähigkeiten”
Fragestellung:
– Untersuchung der Möglichkeiten unter Linux, den ausführenden CPU-Kern für einen Prozess zu bestimmen
– Integration der Information in Programmspuren
– Suche einer portablen und nicht intrusiven Lösung, Taktfrequenzänderungen von CPU-Kernen aufzuzeichnen
– Darauf basierend, Normierung von Zeitintervallen in Programmspuren
Betreuer: Jens Doleschal ([email protected])
Slide 59 LARS: Introduction and Motivation
Performance Analyse und Softwareentwicklung
Thema: Performance Analyse als Integraler Bestandteil der Softwarentwicklung
Fragestellung:
– Integration von Performance Analyse (VAMPIR) in IDE (Eclipse)
– Geeignete Abstraktion und Darstellung von “Performance Summaries”
– Integration von paralleler Performance Analyse in den Softwareentwicklungsprozess
– Betreuer: Matthias Mueller, Andreas Knüpfer
Slide 60 LARS: Introduction and Motivation
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Thank you!
Hope to see you next time…