Copyright 2008, University of Alberta Introduction to High Performance Computing Jon Johansson...
-
Upload
ronald-fletcher -
Category
Documents
-
view
212 -
download
0
Transcript of Copyright 2008, University of Alberta Introduction to High Performance Computing Jon Johansson...
Copyright 2008, University of Alberta
Introduction to High Performance Computing
Jon JohanssonAcademic ICT
University of Alberta
Copyright 2007, University of Alberta
Agenda
• What is High Performance Computing?• What is a “supercomputer”?
• is it a mainframe?
• Supercomputer architectures• Who has the fastest computers?• Speedup• Programming for parallel computing• The GRID??
Copyright 2007, University of Alberta
High Performance Computing
• HPC is the field that concentrates on developing supercomputers and software to run on supercomputers
• a main area of this discipline is developing parallel processing algorithms and software• programs that can be divided into little pieces so that each
piece can be executed simultaneously by separate processors
Copyright 2007, University of Alberta
High Performance Computing
• HPC is about “big problems”, i.e. need:• lots of memory• many cpu cycles• big hard drives
• no matter what field you work in, perhaps your research would benefit by making problems “larger”• 2d → 3d• finer mesh• increase number of elements in the simulation
Copyright 2007, University of Alberta
Grand Challenges• weather forecasting• economic modeling• computer-aided design• drug design• exploring the origins of the universe• searching for extra-terrestrial life• computer vision• nuclear power and weapons simulations
Copyright 2007, University of Alberta
Grand Challenges – ProteinTo simulate the folding of a 300 amino acid protein in water:# of atoms: ~ 32,000folding time: 1 millisecond# of FLOPs: 3 1022 Machine Speed: 1 PetaFLOP/sSimulation Time: 1 year (Source: IBM Blue Gene Project)
IBM’s answer: The Blue Gene ProjectUS$ 100 M of funding to build a1 PetaFLOP/s computer
Ken Dil and Kit Lau’s protein folding model.
Charles L Brooks III, Scripps Research Institute
Copyright 2007, University of Alberta
Grand Challenges - Nuclear• National Nuclear Security
Administration• http://www.nnsa.doe.gov/
• use supercomputers to run three-dimensional codes to simulate instead of test
• address critical problems of materials aging• simulate the environment of
the weapon and try to gauge whether the device continues to be usable
• stockpile science, molecular dynamics and turbulence calculations
http://archive.greenpeace.org/comms/nukes/fig05.gif
Copyright 2007, University of Alberta
Grand Challenges - Nuclear• March 7, 2002: first full-
system three-dimensional simulations of a nuclear weapon explosion
• simulation used more than 480 million cells (grid: 780x780x780)• if the grid is a cube
• 1,920 processors on IBM ASCI White at the Lawrence Livermore National laboratory • 2,931 wall-clock hours
or 122.5 days • 6.6 million CPU hours
ASCI White
Test shot “Badger”
Nevada Test Site – Apr. 1953 Yield: 23 kilotons
http://nuclearweaponarchive.org/Usa/Tests/Upshotk.html
Copyright 2007, University of Alberta
Grand Challenges - Nuclear
• Advanced Simulation and Computing Program (ASC)• http://www.llnl.gov/asc/asc_history/asci_mission.html
Copyright 2007, University of Alberta
Agenda
• What is High Performance Computing?• What is a “supercomputer”?
• is it a mainframe?
• Supercomputer architectures• Who has the fastest computers?• Speedup• Programming for parallel computing• The GRID??
Copyright 2007, University of Alberta
What is a “Mainframe”?
• large and reasonably fast machines• the speed isn't the most important characteristic
• high-quality internal engineering and resulting proven reliability
• expensive but high-quality technical support• top-notch security• strict backward compatibility for older software
Copyright 2007, University of Alberta
What is a “Mainframe”?
• these machines can, and do, run successfully for years without interruption (long uptimes)
• repairs can take place while the mainframe continues to run
• the machines are robust and dependable• IBM coined a term advertise the robustness of their
mainframe computers :• Reliability, Availability and Serviceability (RAS)
Copyright 2007, University of Alberta
What is a “Mainframe”?
• Introducing IBM System z9 109• Designed for the On Demand
Business
• IBM is delivering a holistic approach to systems design
• Designed and optimized with a total systems approach
• Helps keep your applications running with enhanced protection against planned and unplanned outages
• Extended security capabilities for even greater protection capabilities
• Increased capacity with more available engines per server
Copyright 2007, University of Alberta
What is a Supercomputer??
• at any point in time the term “Supercomputer” refers to the fastest machines currently available
• a supercomputer this year might be a mainframe in a couple of years
• a supercomputer is typically used for scientific and engineering applications that must do a great amount of computation
Copyright 2007, University of Alberta
What is a Supercomputer??
• the most significant difference between a supercomputer and a mainframe:• a supercomputer channels all its power into executing a few
programs as fast as possible• if the system crashes, restart the job(s) – no great harm
done• a mainframe uses its power to execute many programs
simultaneously• e.g. – a banking system• must run reliably for extended periods
Copyright 2007, University of Alberta
What is a Supercomputer??
• to see the worlds “fastest” computers look at • http://www.top500.org/
• measure performance with the Linpack benchmark • http://www.top500.org/lists/linpack.php• solve a dense system of linear equations• the performance numbers give a good indication of peak
performance
Terminology
• combining a number of processors to run a program is called variously:• multiprocessing• parallel processing• coprocessing
Terminology
• parallel computing – harnessing a bunch of processors on the same machine to run your computer program
• note that this is one machine• generally a homogeneous architecture
• same processors, memory, operating system
• all the machines in the Top 500 are in this category
Terminology
• distributed computing - harnessing a bunch of processors on different machines to run your computer program
• heterogeneous architecture• different operating systems, cpus, memory
• the terms “parallel” and “distributed” computing are often used interchangeably • the work is divided into sections so each
processor does a unique piece
Terminology
• some distributed computing projects are built on BOINC (Berkeley Open Infrastructure for Network Computing):• SETI@home – Search for Extraterrestrial
Intelligence• Proteins@home – deduces DNA sequence, given
a protein• Hydrogen@home – enhance clean energy
technology by improving hydrogen production and storage (this is beta now)
Copyright 2007, University of Alberta
Quantify Computer Speed
• we want a way to compare computer speeds• count the number of “floating point operations”
required to solve the problem• + - x /
• results of the benchmark are so many Floating point Operations Per Second (FLOPS)
• a supercomputer is a machine that can provide a very large number of FLOPS
Copyright 2007, University of Alberta
N
N
N
N
...
2
...21
...
2
...21
Floating Point Operations
• multiply 2 1000x1000 matrices• for each resulting array element
• 1000 multiplies• 999 adds
• do this 1,000,000 times• ~109 operations needed• increasing array size has the
number of operations increasing as O(N3)
Copyright 2007, University of Alberta
Agenda
• What is High Performance Computing?• What is a “supercomputer”?
• is it a mainframe?
• Supercomputer architectures• Who has the fastest computers?• Speedup• Programming for parallel computing• The GRID??
Copyright 2007, University of Alberta
High Performance Computing
• supercomputers use many CPUs to do the work• note that all supercomputing architectures have
• processors and some combination cache• some form of memory and IO• the processors are separated from the other processors by
some distance
• there are major differences in the way that the parts are connected
• some problems fit into different architectures better than others
Copyright 2007, University of Alberta
High Performance Computing
• increasing computing power available to researchers allows• increasing problem dimensions• adding more particles to a system• increasing the accuracy of the result• improving experiment turnaround time
Copyright 2007, University of Alberta
Flynn’s Taxonomy
• Michael J. Flynn (1972)• classified computer architectures based on the number
of concurrent instructions and data streams available• single instruction, single data (SISD) – basic old PC• multiple instruction, single data (MISD) – redundant systems• single instruction, multiple data (SIMD) – vector (or array)
processor • multiple instruction, multiple data (MIMD) – shared or
distributed memory systems: symmetric multiprocessors and clusters
• common extension:• single program (or process), multiple data (SPMD)
Copyright 2007, University of Alberta
Architectures
• we can also classify supercomputers according to how the processors and memory are connected• couple processors to a single large memory
address space• couple computers, each with its own memory
address space
Copyright 2007, University of Alberta
Architectures
• Symmetric Multiprocessing (SMP)• Uniform Memory Access
(UMA)• multiple CPUs, residing
in one cabinet, share the same memory
• processors and memory are tightly coupled
• the processors share memory and the I/O bus or data path
Copyright 2007, University of Alberta
Architectures
• SMP• a single copy of the
operating system is in charge of all the processors
• SMP systems range from two to as many as 32 or more processors
Copyright 2007, University of Alberta
Architectures
• SMP• "capability computing"
• one CPU can use all the memory
• all the CPUs can work on a little memory
• whatever you need
Copyright 2007, University of Alberta
Architectures
• UMA-SMP negatives• as the number of CPUs get large the buses
become saturated• long wires cause latency problems
Copyright 2007, University of Alberta
Architectures
• Non-Uniform Memory Access (NUMA)• NUMA is similar to SMP - multiple CPUs share a single
memory space• hardware support for shared memory
• memory is separated into close and distant banks• basically a cluster of SMPs
• memory on the same processor board as the CPU (local memory) is accessed faster than memory on other processor boards (shared memory)
• hence "non-uniform"• NUMA architecture scales much better to higher numbers of
CPUs than SMP
Copyright 2007, University of Alberta
Architectures
Copyright 2007, University of Alberta
Architectures
University of Alberta SGI Origin SGI NUMA cables
Copyright 2007, University of Alberta
Architectures
• Cache Coherent NUMA (ccNUMA)• each CPU has an associated cache• ccNUMA machines use special-purpose hardware to
maintain cache coherence • typically done by using inter-processor communication
between cache controllers to keep a consistent memory image when the same memory location is stored in more than one cache
• ccNUMA performs poorly when multiple processors attempt to access the same memory area in rapid succession
Copyright 2007, University of Alberta
Architectures
Distributed Memory Multiprocessor (DMMP)
• each computer has its own memory address space
• looks like NUMA but there is no hardware support for remote memory access
• the special purpose switched network is replaced by a general purpose network such as Ethernet or more specialized interconnects:
• Infiniband• Myrinet Lattice: Calgary’s HP ES40 and ES45
cluster – each node has 4 processors
Copyright 2007, University of Alberta
Architectures
• Massively Parallel Processing (MPP) Cluster of commodity PCs• processors and memory are loosely coupled• "capacity computing"• each CPU contains its own memory and copy of the
operating system and application. • each subsystem communicates with the others via a high-
speed interconnect.• in order to use MPP effectively, a problem must be
breakable into pieces that can all be solved simultaneously
Copyright 2007, University of Alberta
Architectures
Copyright 2007, University of Alberta
Architectures
• lots of “how to build a cluster” tutorials on the web – just Google:
• http://www.beowulf.org/• http://www.cacr.caltech.edu/beowulf/tutorial/
building.html
Copyright 2007, University of Alberta
Architectures
• Vector Processor or Array Processor• a CPU design that is able to run mathematical operations on
multiple data elements simultaneously• a scalar processor operates on data elements one at a
time• vector processors formed the basis of most supercomputers
through the 1980s and into the 1990s• “pipeline” the data
Copyright 2007, University of Alberta
Architectures
• Vector Processor or Array Processor• operate on many pieces of data simultaneously• consider the following add instruction:
• C = A + B
• on both scalar and vector machines this means:• add the contents of A to the contents of B and put the sum in C'
• on a scalar machine the operands are numbers• on a vector machine the operands are vectors and the
instruction directs the machine to compute the pair-wise sum of each pair of vector elements
Copyright 2007, University of Alberta
Architectures
• University of Victoria has 4 NEC SX-6/8A vector processors• in the School of Earth and Ocean
Sciences
• each has 32 GB of RAM• 8 vector processors in the box• peak performance is 72 GFLOPS
Copyright 2007, University of Alberta
Agenda
• What is High Performance Computing?• What is a “supercomputer”?
• is it a mainframe?
• Supercomputer architectures• Who has the fastest computers?• Speedup• Programming for parallel computing• The GRID??
Copyright 2007, University of Alberta
BlueGene/L
• The fastest on the Nov. 2007 top 500 list:• http://www.top500.org/
• installed at the Lawrence Livermore National Laboratory (LLNL) (US Department of Energy)• Livermore California
Copyright 2007, University of Alberta
http://www.llnl.gov/asc/platforms/bluegenel/photogallery.html
Copyright 2007, University of Alberta
BlueGene/L• processors: 212992• memory: 72 TB• 104 racks – each has 2048 processors
• the first 64 had 512 GB of RAM (256 MB/processor)• the 40 new racks have 1 TB of RAM (512
MB/processor)
• a Linpack performance of 478.2 TFlop/s• in Nov 2005 it was the only system ever to
exceed the 100 TFlop/s mark• there are now 10 machines over 100 TFlop/s
The Fastest SixSite Computer Processors Year Rmax (Gflops) Rpeak (Gflops)
DOE/NNSA/LLNLUnited States
BlueGene/L - eServer Blue Gene SolutionIBM
212992 2007 478200 596378
Forschungszentrum Juelich (FZJ)Germany
JUGENE - Blue Gene/P SolutionIBM
65536 2007 167300 222822
SGI/New Mexico Computing Applications Center (NMCAC)United States
SGI Altix ICE 8200, Xeon quad core 3.0 GHzSGI
14336 2007 126900 172032
Computational Research Laboratories, TATA SONSIndia
EKA - Cluster Platform 3000 BL460c, Xeon 53xx 3GHz, InfinibandHewlett-Packard
14240 2007 117900 170880
Government AgencySweden
Cluster Platform 3000 BL460c, Xeon 53xx 2.66GHz, InfinibandHewlett-Packard
13728 2007 102800 146430
NNSA/Sandia National LaboratoriesUnited States
Red Storm - Sandia/ Cray Red Storm, Opteron 2.4 GHz dual coreCray Inc.
26569 2007 102200 127531
Copyright 2007, University of Alberta
# of Processors with Time
Copyright 2007, University of Alberta
The number of processors in the fastest machines has increased by about a factor of 200 in the last 15 years
# of Gflops Increase with Time
Copyright 2007, University of Alberta
Machine speed has increased by more than a factor of 5000 in the last 15 years.
Copyright 2007, University of Alberta
Future BlueGene
Copyright 2007, University of Alberta
Agenda
• What is High Performance Computing?• What is a “supercomputer”?
• is it a mainframe?
• Supercomputer architectures• Who has the fastest computers?• Speedup• Programming for parallel computing• The GRID??
Copyright 2007, University of Alberta
Speedup
• how can we measure how much faster our program runs when using more than one processor?
• define Speedup S as:• the ratio of 2 program execution times• constant problem size
• T1 is the execution time for the problem on a single processor (use the “best” serial time)
• TP is the execution time for the problem on P processors
PT
TS 1
PT
TS 1
Copyright 2007, University of Alberta
Speedup
• Linear speedup• the time to execute the
problem decreases by the number of processors
• if a job requires 1 week with 1 processor it will take less that 10 minutes with 1024 processors
Copyright 2007, University of Alberta
Speedup
• Sublinear speedup• the usual case• there are generally some
limitations to the amount of speedup that you get
• communication
Copyright 2007, University of Alberta
Speedup
• Superlinear speedup• very rare• memory access patterns
may allow this for some algorithms
Copyright 2007, University of Alberta
Speedup
• why do a speedup test?• it’s hard to tell how a
program will behave• e.g.
• “Strange” is actually fairly common behaviour for un-tuned code
• in this case:• linear speedup to ~10
cpus• after 24 cpus speedup is
starting to decrease
Copyright 2007, University of Alberta
Speedup
• to use more processors efficiently change this behaviour• change loop structure • adjust algorithms• ??
• run jobs with 10-20 processors so the machines are used efficiently
Copyright 2007, University of Alberta
Speedup
• one class of jobs that have linear speed up are called “embarrassingly parallel”• a better name might be “perfectly” parallel
• doesn’t take much effort to turn the problem into a bunch of parts that can be run in parallel:• parameter searches• rendering the frames in a computer animation• brute force searches in cryptography
Copyright 2007, University of Alberta
Speedup
• we have been discussing Strong Scaling• the problem size is fixed and we increase the number of
processors• decrease computational time (Amdahl Scaling)
• the amount of work available to each processor decreases as the number of processors increases
• eventually, the processors are doing more communication that number crunching and the speedup curve flattens
• difficult to have high efficiency for large numbers of processors
Copyright 2007, University of Alberta
Speedup
• we are often interested in Weak Scaling• double the problem size when we double the number of
processors• constant computational time (Gustafson scaling)
• the amount of work for each processor has stays roughly constant
• parallel overhead is (hopefully) small compared to the real work the processor does
• e.g. Weather prediction
Copyright 2007, University of Alberta
Amdahl’s Law
• Gene Amdahl: 1967• parallelize some of the
program – some must remain serial
• f is the fraction of the calculation that is serial
• 1-f is the fraction of the calculation that is parallel
• the maximum speedup that can be obtained by using P processors is:
Pf
fS
)1(1
max
Pf
fS
)1(1
max
f 1-f
serial parallel
Copyright 2007, University of Alberta
Amdahl’s Law
• if 25% of the calculation must remain serial the best speedup you can obtain is 4
• need to parallelize as much of the program as possible to get the best advantage from multiple processors
Copyright 2007, University of Alberta
Agenda
• What is High Performance Computing?• What is a “supercomputer”?
• is it a mainframe?
• Supercomputer architectures• Who has the fastest computers?• Speedup• Programming for parallel computing• The GRID??
Copyright 2007, University of Alberta
Parallel Programming
• need to do something to your program to use multiple processors
• need to incorporate commands into your program which allow multiple threads to run
• one thread per processor• each thread gets a piece of the work• several ways (APIs) to do this …
Copyright 2007, University of Alberta
Parallel Programming
• OpenMP• introduce statements into your code
• in C: #pragma• in FORTRAN: C$OMP or !$OMP
• can compile serial and parallel executables from the same source code
• restricted to shared memory machines• not clusters!
• www.openmp.org
Copyright 2007, University of Alberta
Parallel Programming
• OpenMP• demo: MatCrunch
• mathematical operations on the elements of an array• introduce 2 OMP directives before a loop
• # pragma omp parallel // define a parallel section
• # pragma omp for // loop is to be parallel• serial section: 4.03 sec• parallel section – 1 cpu: 40.27 secs• parallel section – 2 cpu: 20.25 secs• speedup = 1.99 // not bad for adding 2 lines
Copyright 2007, University of Alberta
Parallel Programming
• for a larger number of processors the speedup for MatCrunch is not linear
• need to do the speedup test to see how your program will behave
Copyright 2007, University of Alberta
Parallel Programming
• MPI (Message Passing Interface)• a standard set of communication subroutine libraries
• works for SMPs and clusters
• programs written with MPI are highly portable • information and downloads
• http://www.mpi-forum.org/ • MPICH: http://www-unix.mcs.anl.gov/mpi/mpich/ • LAM/MPI: http://www.lam-mpi.org/ • Open MPI: http://www.open-mpi.org/
Copyright 2007, University of Alberta
Parallel Programming
• MPI (Message Passing Interface)• supports the SPMD, single program multiple
data model• all processors use the same program• each processor has its own data
• think of a cluster – each node is getting a copy of the program but running a specific portion of it with its own data
Copyright 2007, University of Alberta
Parallel Programming
• it’s possible to combine OpenMP and MPI for running on clusters of SMP machines
• the trick in parallel programming is to keep all the processors• working (“load balancing”) • working on data that no other processor needs to
touch (there aren’t any cache conflicts)
Copyright 2007, University of Alberta
Agenda
• What is High Performance Computing?• What is a “supercomputer”?
• is it a mainframe?
• Supercomputer architectures• Who has the fastest computers?• Speedup• Programming for parallel computing• The GRID??
Copyright 2007, University of Alberta
Grid Computing• A computational grid:
• is a large-scale distributed computing infrastructure• composed of geographically distributed, autonomous
resource providers• lots of computers joined together• requires excellent networking that supports resource
sharing and distribution• offers access to all the resources that are part of the grid
• compute cycles• storage capacity• visualization/collaboration
• is intended for integrated and collaborative use by multiple organizations
Copyright 2007, University of Alberta
Grids
• Ian Foster (the “Father of the Grid”) says that to be a Grid three points must be met• computing resources are not administered centrally
• many sites connected• open standards are used
• not a proprietary system• non-trivial quality of service is achieved
• it is available most of the time
• CERN says a Grid is “a service for sharing computer power and data storage capacity over the Internet”
Copyright 2007, University of Alberta
Canadian Academic Computing Sites in 2000
Copyright 2007, University of Alberta
Canadian Grids
• Some sites in Canada have tied their resources together to form 7 Canadian Grid Consortia:• ACENET Atlantic Computational Excellence Network • CLUMEQ Consortium Laval UQAM McGill and Eastern
Quebec for High Performance Computing
• SCINET University of Toronto
• HPCVL High Performance Computing Virtual Laboratory
• RQCHP Reseau Quebecois de calcul de haute performance
• SHARCNET Shared Hierarchical Academic Research Computing Network
• WESTGRID Alberta, British Columbia
• Some sites in Canada have tied their resources together to form 7 Canadian Grid Consortia:• ACENET Atlantic Computational Excellence Network • CLUMEQ Consortium Laval UQAM McGill and Eastern
Quebec for High Performance Computing
• SCINET University of Toronto
• HPCVL High Performance Computing Virtual Laboratory
• RQCHP Reseau Quebecois de calcul de haute performance
• SHARCNET Shared Hierarchical Academic Research Computing Network
• WESTGRID Alberta, British Columbia
Copyright 2007, University of Alberta
WestGrid
Edmonton
Calgary
UBC Campus
SFU Campus
Copyright 2007, University of Alberta
Grids
• the ultimate goal of the Grid idea is to have a system that you can submit a job to, so that:• your job uses resources that fit requirements that you
specify 128 nodes on an SMP 200 GB of RAM
• or 256 nodes on a PC cluster 1 GB/processor
• when done the results come back to you
• you don’t care where the job runs• Vancouver or St. John’s or in between
Copyright 2007, University of Alberta
Sharing Resources
• HPC resources are not available quite as readily as your desktop computer
• the resources must be shared fairly• the idea is that each person get as much of the resource as
necessary to run their job for a “reasonable” time• if the job can’t finish in the allotted time the job needs to
“checkpoint”• save enough information to begin running again from
where it left off
Copyright 2007, University of Alberta
Sharing Resources
• Portable Batch System (Torque)
• submit a job to PBS• job is placed in a queue
with other users’ jobs• jobs in the queue are
prioritized by a scheduler• your job executes at
some time in the futureFacility or Site
Shared File System
`
Desktop Head Node
Execution Nodes
log in submit jobs
An HPC Site
Copyright 2007, University of Alberta
Sharing Resources
• When connecting to a Grid we need a layer of “middleware” tools to securely access the resources
• Globus is one example• http://www.globus.org/
WestGrid
File Store Site
`
Desktop Home Site
Execution Sites
log in submit jobs
A Grid of HPC Sites
Copyright 2007, University of Alberta
Questions?Many details in other sessions of this
workshop!