Post on 04-Apr-2018
7/30/2019 Pendahuluan Paralel Komputer
1/167
Introduction toHigh Performance Computing:
Parallel Computing, DistributedComputing, Grid Computing and More
Dr. Jay Boisseau
Director, Texas Advanced Computing Centerboisseau@tacc.utexas.edu
December 3, 2001
The University of Texas at AustinTexas Advanced Computing Center
mailto:boisseau@tacc.utexas.edumailto:boisseau@tacc.utexas.edu7/30/2019 Pendahuluan Paralel Komputer
2/167
Introduction to High Performance Computing
Outline
Preface
What is High Performance Computing?
Parallel Computing
Distributed Computing, Grid Computing, andMore
Future Trends in HPC
7/30/2019 Pendahuluan Paralel Komputer
3/167
Introduction to High Performance Computing
Purpose
Purpose of this workshop: to educate researchers about the value and
impact of high performance computing (HPC)techniques and technologies in conducting
computational science and engineering
Purpose of this presentation:
to educate researchers about the techniques and
tools ofparallel computing, and to show them thepossibilities presented by distributed computingand Grid computing
7/30/2019 Pendahuluan Paralel Komputer
4/167
Introduction to High Performance Computing
Goals
Goals of this presentation are to help you:1. understand the big picture of high performance
computing
2. develop a comprehensive understanding ofparallel computing
3. begin to understand how Grid and distributedcomputing will further enhance computationalscience capabilities
7/30/2019 Pendahuluan Paralel Komputer
5/167
Introduction to High Performance Computing
Content and Context
This material is an introductionand anoverview It is nota comprehensive HPC, so further reading
(much more!) is recommended.
Presentation is followed by additionalspeakers with detailed presentations onspecific HPC and science topics
Together, these presentations will helpprepare you to use HPC in your scientificdiscipline.
7/30/2019 Pendahuluan Paralel Komputer
6/167
Introduction to High Performance Computing
Background - me
Director of the Texas Advanced ComputingCenter (TACC) at the University of Texas
Formerly at San Diego Supercomputer
Center (SDSC), Artic Region SupercomputingCenter
10+ years in HPC
Known Luis for 4 years - plan to developstrong relationship between TACC andCeCalCULA
7/30/2019 Pendahuluan Paralel Komputer
7/167Introduction to High Performance Computing
Background TACC
Mission: to enhance the academic research capabilities of
the University of Texas and its affiliatesthroughthe application of advanced computing resources
and expertise
TACC activities include:
Resources
Support Development
Applied research
7/30/2019 Pendahuluan Paralel Komputer
8/167Introduction to High Performance Computing
TACC Activities
TACC resources and support includes: HPC systems
Scientific visualization resources
Data storage/archival systems
TACC research and development areas:
HPC
Scientific Visualization
Grid Computing
7/30/2019 Pendahuluan Paralel Komputer
9/167Introduction to High Performance Computing
Current HPC Systems
FDDI
HiPPI
CRAY SV1
16 CPU, 16GB
Memory
ARCHIVE640
GB
CRAY T3E
256+ procs
128 MB/proc
500
GBaurora
goldenIBM SP
64+ procs
256 MB/proc
azure
300
GB
Ascend
Router
7/30/2019 Pendahuluan Paralel Komputer
10/167Introduction to High Performance Computing
New HPC Systems
Four IBM p690 HPC servers 16 Power4 Processors
1.3 GHz: 5.2 Gflops per proc,83.2 Gflops per server
16 GB Shared Memory >200 GB/s memory bandwidth!
144 GB Disk
1 TB disk to partition across servers Will configure as single system (1/3 Tflop)
with single GPFS system (1 TB) in 2Q02
7/30/2019 Pendahuluan Paralel Komputer
11/167Introduction to High Performance Computing
New HPC Systems
IA64 Cluster 20 2-way nodes
Itanium (800 MHz)processors
2 GB memory/node
72 GB disk/node
Myrinet 2000 switch
180GB shared disk
IA32 Cluster 32 2-way nodes
Pentium III (1 GHz)processors
1 GB Memory
18.2 GB disk/node
Myrinet 2000 Switch
750 GB IBM GPFS parallel file system for both clusters
7/30/2019 Pendahuluan Paralel Komputer
12/167Introduction to High Performance Computing
World-Class Vislab
SGI Onyx2 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines
24 GB Memory, 750 GB Disk
Front and Rear Projection Systems
3x1 cylindrically-symmetric Power Wall 5x2 large-screen, 16:9 panel Power Wall
Matrix switch between systems, projectors, rooms
7/30/2019 Pendahuluan Paralel Komputer
13/167Introduction to High Performance Computing
More Information
URL: www.tacc.utexas.edu
E-mail Addresses:
General Information: admin@tacc.utexas.edu
Technical assistance: remark@tacc.utexas.edu
Telephone Numbers:
Main Office: (512) 475-9411
Facsimile transmission: (512) 475-9445 Operations Room: (512) 475-9410
http://www.tacc.utexas.edu/mailto:admin@tacc.utexas.edumailto:remark@tacc.utexas.edumailto:remark@tacc.utexas.edumailto:admin@tacc.utexas.eduhttp://www.tacc.utexas.edu/7/30/2019 Pendahuluan Paralel Komputer
14/167Introduction to High Performance Computing
Outline
Preface
What is High Performance Computing?
Parallel Computing
Distributed Computing, Grid Computing, andMore
Future Trends in HPC
7/30/2019 Pendahuluan Paralel Komputer
15/167Introduction to High Performance Computing
Supercomputing
First HPC systems were vector-basedsystems (e.g. Cray)
named supercomputers because they were an
order of magnitude more powerful than
commercial systems
Now, supercomputer has little meaning
large systems are now just scaled up versions of
smaller systems
However, high performance computing hasmany meanings
7/30/2019 Pendahuluan Paralel Komputer
16/167
Introduction to High Performance Computing
HPC Defined
High performance computing: can mean high flop count
per processor
totaled over many processors working on the same
problem totaled over many processors working on related
problems
can mean faster turnaround time
more powerful system scheduled to first available system(s)
using multiple systems simultaneously
7/30/2019 Pendahuluan Paralel Komputer
17/167
Introduction to High Performance Computing
My Definitions
HPC: anycomputational technique thatsolves a large problem faster than possibleusing single, commoditysystems
Custom-designed, high-performance processors(e.g. Cray, NEC)
Parallel computing
Distributed computing
Grid computing
7/30/2019 Pendahuluan Paralel Komputer
18/167
Introduction to High Performance Computing
My Definitions
Parallel computing: single systems with manyprocessors working on the same problem
Distributed computing: many systems loosely
coupled by a scheduler to work on relatedproblems
Grid Computing: many systems tightly
coupled by software and networks to worktogether on single problems or on relatedproblems
7/30/2019 Pendahuluan Paralel Komputer
19/167
Introduction to High Performance Computing
Importance of HPC
HPC has had tremendousimpact on all areasof computational science and engineering inacademia, government, and industry.
Many problems have been solved with HPCtechniques that were impossible to solvewithindividual workstations or personalcomputers.
7/30/2019 Pendahuluan Paralel Komputer
20/167
Introduction to High Performance Computing
Outline
Preface
What is High Performance Computing?
Parallel Computing
Distributed Computing, Grid Computing, andMore
Future Trends in HPC
7/30/2019 Pendahuluan Paralel Komputer
21/167
7/30/2019 Pendahuluan Paralel Komputer
22/167
Introduction to High Performance Computing
Parallel vs. Serial Computers
Two big advantages of parallel computers:1. total performance
2. total memory
Parallel computers enable us to solveproblems that:
benefit from, or require, fast solution
require large amounts of memory
example that requires both: weather forecasting
7/30/2019 Pendahuluan Paralel Komputer
23/167
Introduction to High Performance Computing
Parallel vs. Serial Computers
Some benefits of parallel computing include: more data points
bigger domains
better spatial resolution
more particles more time steps
longer runs
better temporal resolution
faster execution faster time to solution more solutions in same time
lager simulations in real time
7/30/2019 Pendahuluan Paralel Komputer
24/167
Introduction to High Performance Computing
Serial Processor Performance
Time (years)
perform
ance
Although Moores
Law predicts that
single processor
performancedoubles every 18months, eventuallyphysical limits on
manufacturingtechnology will bereached
7/30/2019 Pendahuluan Paralel Komputer
25/167
Introduction to High Performance Computing
Types of Parallel Computers
The simplest and most useful way to classifymodern parallel computers is by their memorymodel:
shared memory
distributed memory
7/30/2019 Pendahuluan Paralel Komputer
26/167
Introduction to High Performance Computing
P P P P P P
BUS
Memory
M
P
M
P
M
P
M
P
M
P
M
P
Network
Shared memory - single addressspace. All processors have access to apool of shared memory. (Ex: SGIOrigin, Sun E10000)
Distributed memory - eachprocessor has its own local
memory. Must do message passingto exchange data betweenprocessors. (Ex: CRAY T3E, IBMSP, clusters)
Shared vs. Distributed Memory
7/30/2019 Pendahuluan Paralel Komputer
27/167
Introduction to High Performance Computing
P P P P P P
BUS
Memory
Uniform memory access (UMA):Each processor has uniformaccess to memory. Also knownas symmetric multiprocessors, orSMPs (Sun E10000)
P P P P
BUS
Memory
P P P P
BUS
Memory
Network
Non-uniform memory access(NUMA): Time for memoryaccess depends on locationof data. Local access is fasterthan non-local access. Easierto scale than SMPs (SGIOrigin)
Shared Memory: UMA vs. NUMA
7/30/2019 Pendahuluan Paralel Komputer
28/167
Introduction to High Performance Computing
Distributed Memory: MPPs vs. Clusters
Processor-memory nodes are connected bysome type of interconnect network
Massively Parallel Processor (MPP): tightlyintegrated, single system image.
Cluster: individual computers connected by s/w
CPU
MEM
CPU
MEM
CPU
MEMCPU
MEMCPU
MEM
CPU
MEM
CPU
MEMCPU
MEM
CPU
MEM
InterconnectNetwork
7/30/2019 Pendahuluan Paralel Komputer
29/167
Introduction to High Performance Computing
Processors, Memory, & Networks
Both shared and distributed memorysystems have:
1. processors: now generally commodity RISCprocessors
2. memory: now generally commodity DRAM
3. network/interconnect: between the processorsand memory (bus, crossbar, fat tree, torus,hypercube, etc.)
We will now begin to describe these piecesin detail, starting with definitions of terms.
7/30/2019 Pendahuluan Paralel Komputer
30/167
Introduction to High Performance Computing
Processor-Related Terms
Clock period (cp): the minimum time intervalbetween successive actions in the processor.Fixed: depends on design of processor.Measured in nanoseconds (~1-5 for fastestprocessors). Inverse of frequency (MHz).
Instruction: an action executed by a processor,such as a mathematical operation or a
memory operation.
Register: a small, extremely fast location forstoring data or instructions in the processor.
7/30/2019 Pendahuluan Paralel Komputer
31/167
Introduction to High Performance Computing
Processor-Related Terms
Functional Unit (FU): a hardware element thatperforms an operation on an operand or pairof operations. Common FUs are ADD, MULT,INV, SQRT, etc.
Pipeline : technique enabling multipleinstructions to be overlapped in execution.
Superscalar: multiple instructions are possibleper clock period.
Flops: floating point operations per second.
7/30/2019 Pendahuluan Paralel Komputer
32/167
Introduction to High Performance Computing
Processor-Related Terms
Cache: fast memory (SRAM) near theprocessor. Helps keep instructions and dataclose to functional units so processor canexecute more instructions more rapidly.
Translation-Lookaside Buffer (TLB): keepsaddresses of pages (block of memory) inmain memory that have recently been
accessed (a cache for memory addresses)
7/30/2019 Pendahuluan Paralel Komputer
33/167
Introduction to High Performance Computing
Memory-Related Terms
SRAM: Static Random Access Memory (RAM).Very fast (~10 nanoseconds), made using thesame kind of circuitry as the processors, sospeed is comparable.
DRAM: Dynamic RAM. Longer access times(~100 nanoseconds), but hold more bits andare much less expensive (10x cheaper).
Memory hierarchy: the hierarchy of memory in aparallel system, from registers to cache tolocal memory to remote memory. More later.
7/30/2019 Pendahuluan Paralel Komputer
34/167
Introduction to High Performance Computing
Interconnect-Related Terms
Latency: Networks: How long does it take to start sending a
"message"? Measured in microseconds.
Processors: How long does it take to output
results of some operations, such as floating pointadd, divide etc., which are pipelined?)
Bandwidth: What data rate can be sustained
once the message is started? Measured inMbytes/sec or Gbytes/sec
7/30/2019 Pendahuluan Paralel Komputer
35/167
Introduction to High Performance Computing
Interconnect-Related Terms
Topology: the manner in which the nodes areconnected.
Best choice would be a fully connected network(every processor to every other). Unfeasible for
cost and scaling reasons. Instead, processors are arranged in some
variation of a grid, torus, or hypercube.
3-d hypercube 2-d mesh 2-d torus
7/30/2019 Pendahuluan Paralel Komputer
36/167
Introduction to High Performance Computing
Processor-Memory Problem
Processors issue instructions roughly everynanosecond.
DRAM can be accessed roughly every 100
nanoseconds (!). DRAM cannot keep processors busy! And the
gap is growing:
processors getting faster by 60% per year DRAM getting faster by 7% per year (SDRAM and
EDO RAM might help, but not enough)
7/30/2019 Pendahuluan Paralel Komputer
37/167
Introduction to High Performance Computing
Processor-Memory Performance Gap
Proc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Performa
nce Moores Law
From D. Patterson, CS252, Spring 1998 UCB
7/30/2019 Pendahuluan Paralel Komputer
38/167
Introduction to High Performance Computing
Processor-Memory Performance Gap
Problem becomes worse when remote(distributed or NUMA) memory is needed
network latency is roughly 1000-10000nanoseconds (roughly 1-10 microseconds)
networks getting faster, but not fast enough
Therefore, cache is used in all processors
almost as fast as processors (same circuitry)
sits between processors and local memory expensive, can only use small amounts
must design system to load cache effectively
7/30/2019 Pendahuluan Paralel Komputer
39/167
Introduction to High Performance Computing
CPU
Main Memory
Cache
Processor-Cache-Memory
Cache is much smaller than main memoryand hence there is mappingof data frommain memory to cache.
7/30/2019 Pendahuluan Paralel Komputer
40/167
Introduction to High Performance Computing
CPU
Cache
Local
Memory
Remote
Memory
SPEED SIZECOST/BIT
Memory Hierarchy
7/30/2019 Pendahuluan Paralel Komputer
41/167
Introduction to High Performance Computing
Cache-Related Terms
ICACHE : Instruction cache
DCACHE (L1) : Data cache closest toregisters
SCACHE (L2) : Secondary data cache Data from SCACHE has to go through DCACHE
to registers
SCACHE is larger than DCACHE
Not all processors have SCACHE
7/30/2019 Pendahuluan Paralel Komputer
42/167
Introduction to High Performance Computing
Cache Benefits
Data cache was designed with two keyconcepts in mind
Spatial Locality When an element is referenced its neighbors will be
referenced also Cache lines are fetched together
Work on consecutive data elements in the same cacheline
Temporal Locality When an element is referenced, it might be referencedagain soon
Arrange code so that data in cache is reused often
7/30/2019 Pendahuluan Paralel Komputer
43/167
Introduction to High Performance Computing
cache
main memory
Direct-Mapped Cache
Direct mapped cache: A block from main memory can goin exactly one place in the cache. This is called directmapped because there is direct mapping from any blockaddress in memory to a single location in the cache.
7/30/2019 Pendahuluan Paralel Komputer
44/167
Introduction to High Performance Computing
cache
Main memory
Fully Associative Cache
Fully Associative Cache : A block from main memory can beplaced in any location in the cache. This is called fullyassociative because a block in main memory may be
associated with any entry in the cache.
7/30/2019 Pendahuluan Paralel Komputer
45/167
Introduction to High Performance Computing
2-way set-associative cache
Main memory
Set Associative Cache
Set associative cache : The middle range of designs betweendirect mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a blockfrom main memory can go into N (N > 1) locations in the cache.
7/30/2019 Pendahuluan Paralel Komputer
46/167
Introduction to High Performance Computing
Cache-Related Terms
Least Recently Used (LRU): Cachereplacement strategy for set associativecaches. The cache block that is least recentlyused is replaced with a new block.
Random Replace: Cache replacement strategyfor set associative caches. A cache block israndomly replaced.
7/30/2019 Pendahuluan Paralel Komputer
47/167
Introduction to High Performance Computing
Example: CRAY T3E Cache
The CRAY T3E processors can execute 2 floating point ops (1 add, 1 multiply) and
2 integer/memory ops (includes 2 loads or 1 store)
To help keep the processors busy on-chip 8 KB direct-mapped data cache
on-chip 8 KB direct-mapped instruction cache
on-chip 96 KB 3-way set associative secondary
data cachewith random replacement.
7/30/2019 Pendahuluan Paralel Komputer
48/167
Introduction to High Performance Computing
Putting the Pieces Together
Recall: Shared memory architectures:
Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000
Non-Uniform Memory Access (NUMA): Most common areDistributed Shared Memory (DSM), or cc-NUMA (cachecoherent NUMA) systems. Ex: SGI Origin 2000
Distributed memory architectures: Massively Parallel Processor (MPP): tightly integrated
system, single system image. Ex: CRAY T3E, IBM SP Clusters: commodity nodes connected by interconnect.
Example: Beowulf clusters.
7/30/2019 Pendahuluan Paralel Komputer
49/167
Introduction to High Performance Computing
Symmetric Multiprocessors (SMPs)
SMPs connect processors to global sharedmemory using one of:
bus
crossbar
Provides simple programming model, but hasproblems:
buses can become saturated
crossbar size must increase with # processors
Problem grows with number of processors,limiting maximum size of SMPs
7/30/2019 Pendahuluan Paralel Komputer
50/167
Introduction to High Performance Computing
Shared Memory Programming
Programming models are easier sincemessage passing is not necessary.Techniques:
autoparallelization via compiler options
loop-level parallelism via compiler directives
OpenMP
pthreads
More on programming models later.
7/30/2019 Pendahuluan Paralel Komputer
51/167
Introduction to High Performance Computing
Massively Parallel Processors
Each processor has its own memory: memory is not shared globally
adds another layer to memory hierarchy (remotememory)
Processor/memory nodes are connected byinterconnect network
many possible topologies
processors must pass data via messages communication overhead must be minimized
7/30/2019 Pendahuluan Paralel Komputer
52/167
Introduction to High Performance Computing
Communications Networks
Custom Many vendors have custom interconnects that
provide high performance for their MPP system
CRAY T3E interconnect is the fastest for MPPs:
lowest latency, highest bandwidth
Commodity
Used in some MPPs and all clusters
Myrinet, Gigabit Ethernet, Fast Ethernet, etc.
7/30/2019 Pendahuluan Paralel Komputer
53/167
Introduction to High Performance Computing
Types of Interconnects
Fully connected not feasible
Array and torus Intel Paragon (2D array), CRAY T3E (3D torus)
Crossbar IBM SP (8 nodes)
Hypercube
SGI Origin 2000 (hypercube), Meiko CS-2 (fattree)
Combinations of some of the above IBM SP (crossbar & fully connected for 80 nodes)
IBM SP (fat tree for > 80 nodes)
7/30/2019 Pendahuluan Paralel Komputer
54/167
Introduction to High Performance Computing
Clusters
Similar to MPPs Commodity processors and memory
Processor performance must be maximized
Memory hierarchy includes remote memory
No shared memory--message passing Communication overhead must be minimized
Different from MPPs
All commodity, including interconnect and OS Multiple independent systems: more robust
Separate I/O systems
7/30/2019 Pendahuluan Paralel Komputer
55/167
Introduction to High Performance Computing
Cluster Pros and Cons
Pros Inexpensive
Fastest processors first
Potential for true parallel I/O
High availability
Cons:
Less mature software (programming and system)
More difficult to manage (changing slowly)
Lower performance interconnects: not as scalableto large number (but have almost caught up!)
7/30/2019 Pendahuluan Paralel Komputer
56/167
Introduction to High Performance Computing
Distributed Memory Programming
Message passing is most efficient MPI
MPI-2
Active/one-sided messages Vendor: SHMEM (T3E), LAPI (SP
Coming in MPI-2
Shared memory models can be implemented
in software, but are not as efficient. More on programming models in the next
section.
7/30/2019 Pendahuluan Paralel Komputer
57/167
Introduction to High Performance Computing
Distributed Shared Memory
More generally called cc-NUMA (cachecoherent NUMA)
Consists ofmSMPs with nprocessors in aglobal address space: Each processor has some local memory (SMP)
Allprocessors can access allmemory: extradirectory hardware on each SMP tracks valuesstored in all SMPs
Hardware guarantees cache coherency Access to memory on other SMPs slower (NUMA)
7/30/2019 Pendahuluan Paralel Komputer
58/167
Introduction to High Performance Computing
Distributed Shared Memory
Easier to build because of slower access toremote memory (no expensive bus/crossbar)
Similar cache problems
Code writers should be aware of datadistribution
Load balance: Minimize access of far
memory
7/30/2019 Pendahuluan Paralel Komputer
59/167
Introduction to High Performance Computing
DSM Rationale and Realities
Rationale: combine easeof SMPprogramming with scalabilityof MPPprogramming at much at cost of MPP
Reality: NUMA introduces additional layers inSMP memory hierarchy relative to SMPs, soscalability is limited if programmed as SMP
Reality: Performance and high scalabilityrequire programming to the architecture.
7/30/2019 Pendahuluan Paralel Komputer
60/167
Introduction to High Performance Computing
Clustered SMPs
Simpler than DSMs: composed of nodes connected by network, like an
MPP or cluster
each node is an SMP
processors on one SMP do not share memory onother SMPs (no directory hardware in SMP nodes)
communication between SMP nodes is bymessage passing
Ex: IBM Power3-based SP systems
7/30/2019 Pendahuluan Paralel Komputer
61/167
Introduction to High Performance Computing
Clustered SMP Diagram
Network
P P P P
BUS
Memory
P P P P
BUS
Memory
7/30/2019 Pendahuluan Paralel Komputer
62/167
Introduction to High Performance Computing
Reasons for Clustered SMPs
Natural extension of SMPs and clusters SMPs offer great performance up to their
crossbar/bus limit
Connecting nodes is how memory and
performance are increased beyond SMP levels Can scale to larger number of processors with
less scalable interconnect
Maximum performance:
Optimize at SMP level - no communication overhead Optimize at MPP level - fewer messages necessary for
same number of processors
7/30/2019 Pendahuluan Paralel Komputer
63/167
Introduction to High Performance Computing
Clustered SMP Drawbacks
Clustering SMPs has drawbacks No shared memory access over entire system,
unlike DSMs
Has other disadvantages of DSMs
Extra layer in memory hierarchy Performance requires more effort from programmer than
SMPs or MPPs
However, clustered SMPs provide a means
for obtaining very high performance andscalability
7/30/2019 Pendahuluan Paralel Komputer
64/167
Introduction to High Performance Computing
Clustered SMP: NPACI Blue Horizon
IBM SP system: Power3 processors: good peak performance (~1.5
Gflops)
better sustained performance (highly superscalar
and pipelined) than for many other processors SMP nodes have 8 Power3 processors
System has 144 SMP nodes (1154 processorstotal)
7/30/2019 Pendahuluan Paralel Komputer
65/167
Introduction to High Performance Computing
Programming Clustered SMPs
NSF: Most users use only MPI, even for intra-node messages
DoE: Most applications are being developed
with MPI (between nodes) and OpenMP(intra-node)
MPI+OpenMP programming is more complex,but mightyield maximum performance
Active messages and pthreads wouldtheoretically give maximum performance
7/30/2019 Pendahuluan Paralel Komputer
66/167
Introduction to High Performance Computing
Data parallelism Task parallelism
Types of Parallelism
Data parallelism: each processor performsthe same task on different sets or sub-regionsof data
Task parallelism: each processor performs adifferent task
Most parallel applications fall somewhere onthe continuum between these two extremes.
7/30/2019 Pendahuluan Paralel Komputer
67/167
Introduction to High Performance Computing
Data vs. Task Parallelism
Example of data parallelism: In a bottling plant, we see several processors, or
bottle cappers, applying bottle caps concurrentlyon rows of bottles.
Example of task parallelism;
In a restaurant kitchen, we see several chefs, orprocessors, working simultaneously on different
parts of different meals.
A good restaurant kitchen also demonstrates loadbalancingand synchronization--more on thosetopics later.
7/30/2019 Pendahuluan Paralel Komputer
68/167
Introduction to High Performance Computing
Example: Master-Worker Parallelism
A common form of parallelism used indeveloping applications years ago (especiallyin PVM) was Master-Worker parallelism:
a single processor is responsible for distributing
data and collecting results (task parallelism) all other processors perform same task on their
portion of data (data parallelism)
7/30/2019 Pendahuluan Paralel Komputer
69/167
Introduction to High Performance Computing
Parallel Programming Models
The primary programming models in currentuse are
Data parallelism - operations are performed inparallel on collections of data structures. A
generalization of array operations. Message passing - processes possess local
memory and communicate with other processesby sending and receiving messages.
Shared memory - each processor has access to asingle shared pool of memory
7/30/2019 Pendahuluan Paralel Komputer
70/167
Introduction to High Performance Computing
Parallel Programming Models
Most parallelization efforts fall under thefollowing categories.
Codes can be parallelized using message-passinglibraries such as MPI.
Codes can be parallelized using compilerdirectives such as OpenMP.
Codes can be written in new parallel languages.
7/30/2019 Pendahuluan Paralel Komputer
71/167
Introduction to High Performance Computing
Programming Models Architectures
Natural mappings data parallel CM-2 (SIMD machine)
message passing IBM SP (MPP)
shared memory SGI Origin, Sun E10000
Implemented mappings
HPF (a data parallel language) and MPI (amessage passing library) have been implemented
on nearly all parallel machines OpenMP (a set of directives, etc. for shared
memory programming) has been implemented onmost shared memory systems.
7/30/2019 Pendahuluan Paralel Komputer
72/167
Introduction to High Performance Computing
SPMD
All current machines are MIMD systems(Multiple Instruction, Multiple Data) and arecapable of either data parallelism or taskparallelism.
The primary paradigmfor programmingparallel machines is the SPMD paradigm:Single Program, Multiple Data
each processor runs a copy of same source code enables data parallelism (through data
decomposition) and task parallelism (throughintrinsic functions that return the processor ID)
7/30/2019 Pendahuluan Paralel Komputer
73/167
Introduction to High Performance Computing
OpenMP - Shared Memory Standard
OpenMP is a new standard for sharedmemory programming: SMPs and cc-NUMAs.
OpenMP provides a standard set of directives,run-time library routines, and
environment variables for parallelizing code undera shared memory model.
Very similar to Cray PVP autotasking directives,but with much more functionality. (Cray now uses
supports OpenMP.) See http://www.openmp.org for more information
7/30/2019 Pendahuluan Paralel Komputer
74/167
Introduction to High Performance Computing
program add_arrays
parameter (n=1000)
real x(n),y(n),z(n)
read(10) x,y,z
do i=1,n
x(i) = y(i) + z(i)
enddo
...
end
Fortran 77program add_arrays
parameter (n=1000)
real x(n),y(n),z(n)
read(10) x,y,z
!$OMP PARALLEL DOdo i=1,n
x(i) = y(i) + z(i)
enddo
...
end
Fortran 77 + OpenMP
Highlighted directive specifies that loop is executed in parallel.Each processor executes a subset of the loop iterations.
OpenMP Example
7/30/2019 Pendahuluan Paralel Komputer
75/167
Introduction to High Performance Computing
MPI - Message Passing Standard
MPI has emerged as the standard formessage passing in both C and Fortranprograms. No longer need to know MPL,PVM, TCGMSG, etc.
MPI is both large and small:
MPI is large, since it contains 125 functions whichgive the programmer fine control over
communications MPI is small, since message passing programs
can be written using a core set of just sixfunctions.
S
7/30/2019 Pendahuluan Paralel Komputer
76/167
Introduction to High Performance Computing
PE 0 calls MPI_SEND to pass the real variable x to PE 1.
PE 1 calls MPI_RECV to receive the real variable y from PE 0
if(myid.eq.0) then
call MPI_SEND(x,1,MPI_REAL,1,100,MPI_COMM_WORLD,ierr)
endif
if(myid.eq.1) thencall MPI_RECV(y,1,MPI_REAL,0,100,MPI_COMM_WORLD,
status,ierr)
endif
MPI Examples - Send and Receive
MPI messages are two-way: they require asend anda matching receive:
MPI E l Gl b l O i
7/30/2019 Pendahuluan Paralel Komputer
77/167
Introduction to High Performance Computing
MPI Example - Global Operations
PE 6 collects the single (1) integer value n from all other processors and puts the
sum (MPI_SUM) into into sum
call MPI_REDUCE(n,allsum,1,MPI_INTEGER,MPI_SUM,6,
MPI_COMM_WORLD,ierr)
MPI also has global operations to broadcastand reduce (collect) information
PE 5 broadcasts the single (1) integer value n to all other processors
call MPI_BCAST(n,1,MPI_INTEGER,5,
MPI_COMM_WORLD,ierr)
MPI I l i
7/30/2019 Pendahuluan Paralel Komputer
78/167
Introduction to High Performance Computing
MPI Implementations
MPI is typically implemented on top of the highestperformance native message passing library forevery distributed memory machine.
MPI is a natural model for distributed memory
machines (MPPs, clusters)
MPI offers higher performance on DSMs beyond thesize of an individual SMP
MPI is useful between SMPs that are clustered
MPI can be implemented on shared memorymachines
E i MPI MPI 2
7/30/2019 Pendahuluan Paralel Komputer
79/167
Introduction to High Performance Computing
Extensions to MPI: MPI-2
A standard for MPI-2 has been developedwhich extends the functionality of MPI. Newfeatures include:
One sided communications - eliminates the need
to post matching sends and receives. Similar infunctionality to the shmemPUT and GET on theCRAY T3E (most systems have analogous library)
Support for parallel I/O
Extended collective operations No full implementation yet - it is difficult for
vendors
MPI O MP
7/30/2019 Pendahuluan Paralel Komputer
80/167
Introduction to High Performance Computing
MPI vs. OpenMP
There is no single best approach to writing aparallel code. Each has pros and cons:
MPI - powerful, general, and universally availablemessage passing library which provides very fine
control over communications, but forces theprogrammer to operate at a relatively low level ofabstraction.
OpenMP - conceptually simple approach for
creating parallel codes on a shared memorymachines, but not applicable to distributedmemory platforms.
MPI O MP
7/30/2019 Pendahuluan Paralel Komputer
81/167
Introduction to High Performance Computing
MPI vs. OpenMP
MPI is the most general (problems types) andportable (platforms, although not efficient forSMPs)
The architecture and the problem type oftenmake the decision for you.
P ll l Lib i
7/30/2019 Pendahuluan Paralel Komputer
82/167
Introduction to High Performance Computing
Parallel Libraries
Finally, there are parallel mathematicslibraries that enable users to write (serial)codes, then call parallel solver routines :
ScaLAPACK is for solving dense linear system of
equations, eigenvalues and least squareproblems. Also see PLAPACK.
PETSc is for solving linear and non-linear partialdifferential equations (includes various iterative
solvers for sparse matrices). Many others: check NETLIB for complete survey:
http://www.netlib.org
H dl i P ll l C ti
7/30/2019 Pendahuluan Paralel Komputer
83/167
Introduction to High Performance Computing
Hurdles in Parallel Computing
There are some hurdles in parallel computing: Scalar performance: Fast parallel codes require
efficient use of the underlying scalar hardware
Parallel algorithms: Not all scalar algorithms
parallelize well, may need to rethink problem Communications: Need to minimize the time spent doing
communications
Load balancing: All processors should do roughly thesame amount of work
Amdahls Law: Fundamental limit on parallel
computing
S l P f
7/30/2019 Pendahuluan Paralel Komputer
84/167
Introduction to High Performance Computing
Scalar Performance
Underlying every good parallel code is a goodscalar code.
If a code scales to 256 processors but only
gets 1% of peak performance, it is still a badparallel code.
Good news: Everything that you know about serialcomputing will be useful in parallel computing!
Bad news: It is difficult to get good performanceout of the processors and memory used in parallelmachines. Need to use cache effectively.
S i l P f
7/30/2019 Pendahuluan Paralel Komputer
85/167
Introduction to High Performance Computing
Number of processors
time
In this case, the parallel
code achieves perfectscaling, but does not
match the performance of
the serial code until 32
processors are used
Serial Performance
Use Cache Effecti el
7/30/2019 Pendahuluan Paralel Komputer
86/167
Introduction to High Performance Computing
main memory
cache
CPU
A simplified memoryhierarchy
Small& fast
Big
& slow
The data cache was designed withtwo key concepts in mind:
Spatial locality - cache is loaded an
entire line (4-32 words) at a time totake advantage of the fact that if alocation in memory is required,nearby locations will probably alsobe required
Temporal locality - once a word isloaded into cache it remains thereuntil the cache line is needed tohold another word of data.
Use Cache Effectively
Non Cache Issues
7/30/2019 Pendahuluan Paralel Komputer
87/167
Introduction to High Performance Computing
Non-Cache Issues
There are other issues to consider to achievegood serial performance:
Force reductions, e.g., replacement of divisionswith multiplications-by-inverse
Evaluate and replace common sub-expressions Pushing loops inside subroutines to minimize
subroutine call overhead
Force function inlining (compiler option)
Perform interprocedural analysis to eliminateredundant operations (compiler option)
Parallel Algorithms
7/30/2019 Pendahuluan Paralel Komputer
88/167
Introduction to High Performance Computing
Parallel Algorithms
The algorithm must be naturally parallel! Certain serial algorithms do not parallelize well.
Developing a new parallel algorithm to replace aserial algorithm can be one of the most difficult
task in parallel computing. Keep in mind that your parallel algorithm may
involve additional work or a higher floating pointoperation count.
Parallel Algorithms
7/30/2019 Pendahuluan Paralel Komputer
89/167
Introduction to High Performance Computing
Parallel Algorithms
Keep in mind that the algorithm should need the minimum amount of communication (Monte
Carlo algorithms are excellent examples)
balance the load among the processors equally
Fortunately, a lot of research has been done inparallel algorithms, particularly in the area of linearalgebra. Dont reinvent the wheel, take full
advantage of the work done by others: use parallel libraries supplied by the vendor whenever
possible! use ScaLAPACK, PETSc, etc. when applicable
Load Balancing
7/30/2019 Pendahuluan Paralel Komputer
90/167
Introduction to High Performance Computing
Busy timeIdle time
t
PE 0PE 1
The figures below show the timeline for parallel codes run on twoprocessors. In both cases, the total amount of work done is thesame, but in the second case the work is distributed more evenlybetween the two processors resulting in a shorter time to solution.
PE 0
PE 1
Synchronizationpoints
Load Balancing
Communications
7/30/2019 Pendahuluan Paralel Komputer
91/167
Introduction to High Performance Computing
Communications
Two key parameters of the communicationsnetwork are
Latency: time required to initiate a message. Thisis the critical parameter in fine grained codes,
which require frequent interprocessorcommunications. Can be thought of as the timerequired to send a message of zero length.
Bandwidth: steady-state rate at which data can be
sent over the network.This is the critical parameterin coarse grained codes, which require infrequentcommunication of large amounts of data.
Latency and Bandwidth Example
7/30/2019 Pendahuluan Paralel Komputer
92/167
Introduction to High Performance Computing
Latency and Bandwidth Example
Bucket brigade: the old style of fighting firesin which the townspeople formed a line fromthe well to the fire and passed buckets ofwater down the line
latency - the delay until the first bucket to arrivesat the fire
bandwidth - the rate at which buckets arrive at thefire
More on Communications
7/30/2019 Pendahuluan Paralel Komputer
93/167
Introduction to High Performance Computing
Sequential: t = t(comp) + t(comm)
Overlapped: t = t(comp) + t(comm) - t(comp) t(comm)
More on Communications
Time spent performing communications isconsidered overhead. Try minimize theimpact of communications:
minimize the effect of latency by combining large
numbers of small messages into small numbers oflarge messages.
communications and computation do not have tobe done sequentially, can often overlap
communication and computations
Combining Small Messages into
7/30/2019 Pendahuluan Paralel Komputer
94/167
Introduction to High Performance Computing
dial
Hi mom
hang up
dialHow are things?
hang up
dial
in the U.S.?
hang up
dialAt this point many mothers
would not pick up the next call.
dial
Hi mom. How are things
in the U.S.?. Yak, yak... hang up
By transmitting a single large message, Ionly have to pay the price for the dialing
latency once. I transmit more informationin less time.
The following examples of phoning home illustrate the valueof combining many small messages into a single larger one.
g gLarger Ones
Overlapping Communications and
7/30/2019 Pendahuluan Paralel Komputer
95/167
Introduction to High Performance Computing
In the following example, a stencil operation is performed on a 10x 10 array that has been distributed over two processors. Assumeperiodic boundary conditions.
Boundary elements - requires data
from neighboring processor
Interior elements
Initiate communications
Perform computations on interior elements
Wait till communications are finished
Perform computations on boundary elements
Stencil operation:
y(i,j)=x(i+1,j)+x(i-1,j)+x(i,j+1)+x(i,j-1)
PE0 PE1
Computations
Amdahls Law
7/30/2019 Pendahuluan Paralel Komputer
96/167
Introduction to High Performance Computing
Amdahls Law places a strict limit on the speedup that can berealized by using multiple processors. Two equivalentexpressions for Amdahls Law are given below:
tN
= (fp/N + f
s)t
1Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where:
fs = serial fraction of codefp = parallel fraction of code = 1 - fs
N = number of processors
Amdahl s Law
Illustration of Amdahls Law
7/30/2019 Pendahuluan Paralel Komputer
97/167
Introduction to High Performance Computing
Number of processors
speedup
It takes only a small fraction of serial content in a code to degrade the
parallel performance. It is essential to determine the scaling behavior ofyour code before doing production runs using large numbers ofprocessors
Illustration of Amdahl s Law
Amdahls Law Vs Reality
7/30/2019 Pendahuluan Paralel Komputer
98/167
Introduction to High Performance Computing
Amdahls Law provides a theoretical upper limit on parallel
speedup assuming that there are no costs forcommunications. In reality, communications (and I/O) willresult in a further degradation of performance.
0 50 100 150 200 250
Number of processors
speedup
f = 0.99
Amdahl s Law Vs. Reality
More on Amdahls Law
7/30/2019 Pendahuluan Paralel Komputer
99/167
Introduction to High Performance Computing
More on Amdahl s Law
Amdahls Law can be generalized to any twoprocesses of with different speeds
Ex.: Apply to fprocessorand fmemory:
The growing processor-memory performance gapwill undermine our efforts at achieving maximumpossible speedup!
Generalized Amdahls Law
7/30/2019 Pendahuluan Paralel Komputer
100/167
Introduction to High Performance Computing
Generalized Amdahl s Law
Amdahls Law can be further generalized to handlean arbitrary number of processes of various speeds.(The total fractions representing each process muststill equal 1.)
This is a weighted Harmonic mean. Applicationperformance is limited by performance of the slowestcomponent as much as it is determined by thefastest.
Ravg =
1
fi
R ii 1
N
Gustafsons Law
7/30/2019 Pendahuluan Paralel Komputer
101/167
Introduction to High Performance Computing
Gustafson s Law
Thus, Amdahls Law predicts that there is amaximum scalability for an application,determined by its parallel fraction, and thislimit is generally not large.
There is a way around this: increase theproblem size bigger problems mean bigger grids or more
particles: bigger arrays
number of serial operations generally remainsconstant; number of parallel operations increases:parallel fraction increases
The 1st Question to Ask Yourself
7/30/2019 Pendahuluan Paralel Komputer
102/167
Introduction to High Performance Computing
Before You Parallelize Your Code
Is it worth my time? Do the CPU requirements justify parallelization?
Do I need a parallel machine in order to getenough aggregate memory?
Will the code be used just once or will it be amajor production code?
Your time is valuable, and it can be very time
consuming to write, debug, and test a parallelcode. The more time you spend writing aparallel code, the less time you have to spenddoing your research.
The 2nd Question to Ask Yourself
7/30/2019 Pendahuluan Paralel Komputer
103/167
Introduction to High Performance Computing
Before You Parallelize Your Code
How should I decompose my problem? Do the computations consist of a large number of
small, independent problems - trajectories,parameter space studies, etc? May want to
consider a scheme in which each processor runsthe calculation for a different set of data
Does each computation have large memory orCPU requirements? Will probably have to break
up a single problem across multiple processors
Distributing the Data
7/30/2019 Pendahuluan Paralel Komputer
104/167
Introduction to High Performance Computing
Distributing the Data
Decision on how to distribute the data shouldconsider these issues: Load balancing:
Often implies an equal distribution of data, but moregenerally means an equal distribution of work
Communications:Want to minimize the impact of communications, taking intoaccount both size and number of messages
Physics:Choice of distribution will depend on the processes that are
being modeled in each direction.
A Data Distribution Example
7/30/2019 Pendahuluan Paralel Komputer
105/167
Introduction to High Performance Computing
A good distribution if the physics of theproblem is the same in both directions.
Minimizes the amount of data that must
be communicated between processors.
If expensive global operations need to be
carried out in the x-direction (ex. FFTs),
this is probably a better choice.
A Data Distribution Example
A More Difficult Example
7/30/2019 Pendahuluan Paralel Komputer
106/167
Introduction to High Performance Computing
Imagine that we are doing a simulation
in which more work is required for the
grid points covering the shaded object.
Neither data distribution from theprevious example will result in good
load balancing.
May need to consider an irregular grid
or a different data structure.
A More Difficult Example
Choosing a Resource
7/30/2019 Pendahuluan Paralel Komputer
107/167
Introduction to High Performance Computing
Choosing a Resource
The following factors should be taken intoaccount when choosing a resource:
What is the granularity of my code?
Are there any special hardware features that I
need or can take advantage of? How many processors will the code be run on?
What are my memory requirements?
By carefully considering these points, you canmake the right choice of computationalplatform.
Choosing a Resource: Granularity
7/30/2019 Pendahuluan Paralel Komputer
108/167
Introduction to High Performance Computing
Granularity is a measure of the amount of work done by eachprocessor between synchronization events.
PE 0
PE 1
Low-granularity application
PE 0
PE 1
High-granularity application
Generally, latency is the critical parameter for low-granularitycodes, while processor performance is the key factor for high-granularity applications.
Choosing a Resource: Granularity
Choosing a Resource: SpecialH d F
7/30/2019 Pendahuluan Paralel Komputer
109/167
Introduction to High Performance Computing
Hardware Features
Various HPC platforms have differenthardware features that your code may beable to take advantage of. Examples include:
Hardware support for divide and square root
operations (IBM SP) Parallel I/O file system (IBM SP)
Data streams (CRAY T3E)
Control over cache alignment (CRAY T3E)
E-registers for by-passing cache hierarchy(CRAY T3E)
Importance of Parallel Computing
7/30/2019 Pendahuluan Paralel Komputer
110/167
Introduction to High Performance Computing
Importance of Parallel Computing
High performance computing has becomealmost synonymous with parallel computing.
Parallel computing is necessary to solve bigproblems (high resolution, lots of timesteps,etc.) in science and engineering.
Developing and maintaining efficient, scalableparallel applications is difficult. However, the
payoff can be tremendous.
Importance of Parallel Computing
7/30/2019 Pendahuluan Paralel Komputer
111/167
Introduction to High Performance Computing
Importance of Parallel Computing
Before jumping in, think about whether or not your code truly needs to beparallelized
how to decompose your problem.
Then choose a programming model based onyour problem and your available architecture.
Take advantage of the resources that areavailable - compilers libraries, debuggers,performance analyzers, etc. - to help youwrite efficient parallel code.
Useful References
7/30/2019 Pendahuluan Paralel Komputer
112/167
Introduction to High Performance Computing
Useful References
Hennessy, J. L. and Patterson, D. A. ComputerArchitecture: A Quantitative Approach.
Patterson, D.A. and Hennessy, J.L., ComputerOrganization and Design: The Hardware/Software
Interface. D. Dowd, High Performance Computing.
D. Kuck, High Performance Computing. Oxford U.Press (New York) 1996.
D. Culler and J. P. Singh, Parallel ComputerArchitecture.
Outline
7/30/2019 Pendahuluan Paralel Komputer
113/167
Introduction to High Performance Computing
Outline
Preface What is High Performance Computing?
Parallel Computing
Distributed Computing, Grid Computing, andMore
Future Trends in HPC
Distributed Computing
7/30/2019 Pendahuluan Paralel Komputer
114/167
Introduction to High Performance Computing
Distributed Computing
Concept has been used for two decades Basic idea: run scheduler across systems to
runs processes on least-used systems first
Maximize utilization
Minimize turnaround time
Have to load executables and input files toselected resource
Shared file system
File transfers upon resource selection
Examples of Distributed Computing
7/30/2019 Pendahuluan Paralel Komputer
115/167
Introduction to High Performance Computing
a p es o st buted Co put g
Workstation farms, Condor flocks, etc. Generally share file system
SETI@home, Entropia, etc.
Only one source code; central server copiescorrect binary code and input data to each system
Napster, Gnutella: file/data sharing
NetSolve Runs numerical kernel on any of multipleindependent systems, much like a Grid solution
SETI@home: Global DistributedC ti
7/30/2019 Pendahuluan Paralel Komputer
116/167
Introduction to High Performance Computing
Computing Running on 500,000 PCs, ~1000 CPU Years per Day
485,821 CPU Years so far
Sophisticated Data & Signal Processing Analysis
Distributes Datasets from Arecibo Radio Telescope
Distributed vs. Parallel Computing
7/30/2019 Pendahuluan Paralel Komputer
117/167
Introduction to High Performance Computing
p g
Different Distributed computing executes independent (but
possibly related) applications on different systems;jobs do not communicate with each other
Parallel computing executes a single applicationacross processors, distributing the work and/ordata but allowing communication betweenprocesses
Non-exclusive: can distribute parallelapplications to parallel computing systems
Grid Computing
7/30/2019 Pendahuluan Paralel Komputer
118/167
Introduction to High Performance Computing
p g
Enable communities (virtual organizations)to share geographically distributed resourcesas they pursue common goalsin theabsence of central control, omniscience, trust
relationships.
Resources (HPC systems, visualizationsystems & displays, storage systems,
sensors, instruments, people) are integratedvia middleware to facilitate use of allresources.
Why Grids?
7/30/2019 Pendahuluan Paralel Komputer
119/167
Introduction to High Performance Computing
y
Resources have different functions, butmultiple classes resources are necessary formost interesting problems.
Power of any single resource is smallcompared to aggregations of resources
Network connectivity is increasing rapidly inbandwidth and availability
Large problems require teamwork andcomputation
Network Bandwidth Growth
7/30/2019 Pendahuluan Paralel Komputer
120/167
Introduction to High Performance Computing
Network vs. computer performance
Computer speed doubles every 18 months
Network speed doubles every 9 months
Difference = order of magnitude per 5 years
1986 to 2000
Computers: x 500
Networks: x 340,000
2001 to 2010 Computers: x 60
Networks: x 4000
Moores Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-
2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
Grid Possibilities
7/30/2019 Pendahuluan Paralel Komputer
121/167
Introduction to High Performance Computing
A biochemist exploits 10,000 computers to screen100,000 compounds in an hour
1,000 physicists worldwide pool resources for petaflopanalyses of petabytes of data
Civil engineers collaborate to design, execute, &analyze shake table experiments
Climate scientists visualize, annotate, & analyzeterabyte simulation datasets
An emergency response team couples real time data,weather model, population data
Some Grid Usage Models
7/30/2019 Pendahuluan Paralel Komputer
122/167
Introduction to High Performance Computing
g
Distributed computing: job scheduling on Gridresources with secure, automated data transfer
Workflow: synchronized scheduling and automateddata transfer from one system to next in pipeline (e.g.
HPC system to visualization lab to storage system) Coupled codes, with pieces running on different
systems simultaneously
Meta-applications: parallel apps spanning multiple
systems
Grid Usage Models
7/30/2019 Pendahuluan Paralel Komputer
123/167
Introduction to High Performance Computing
g
Some models are similar to models alreadybeing used, but are much simpler due to:
single sign-on
automatic process scheduling
automated data transfers
But Grids can encompass new resourceslikes sensors and instruments, so new usage
models will arise
Selected Major Grid Projects
7/30/2019 Pendahuluan Paralel Komputer
124/167
Introduction to High Performance Computing
Name URL & Sponsors FocusAccess Grid www.mcs.anl.gov/FL/
accessgrid; DOE, NSFCreate & deploy group collaboration systemsusing commodity technologies
BlueGrid IBM Grid testbed linking IBM laboratories
DISCOM www.cs.sandia.gov/discom
DOE Defense Programs
Create operational Grid providing access toresources at three U.S. DOE weapons
laboratories
DOE ScienceGrid
sciencegrid.org
DOE Office of Science
Create operational Grid providing access toresources & applications at U.S. DOE sciencelaboratories & partner universities
Earth SystemGrid (ESG)
earthsystemgrid.orgDOE Office of Science
Delivery and analysis of large climate modeldatasets for the climate research community
EuropeanUnion (EU)DataGrid
eu-datagrid.org
European Union
Create & apply an operational grid forapplications in high energy physics,environmental science, bioinformatics
g
g
g
g
g
g
Selected Major Grid Projects
7/30/2019 Pendahuluan Paralel Komputer
125/167
Introduction to High Performance Computing
Name URL/Sponsor FocusEuroGrid, GridInteroperability (GRIP)
eurogrid.org
European Union
Create technologies for remote access tosupercomputer resources & simulation codes;in GRIP, integrate with Globus
Fusion Collaboratory fusiongrid.org
DOE Off. Science
Create a national computational collaboratoryfor fusion research
Globus Project globus.org
DARPA, DOE, NSF,NASA, Msoft
Research on Grid technologies; developmentand support of Globus Toolkit; application anddeployment
GridLab gridlab.org
European Union
Grid technologies and applications
GridPP gridpp.ac.uk
U.K. eScience
Create & apply an operational grid within theU.K. for particle physics research
Grid ResearchIntegration Dev. &Support Center
grids-center.org
NSF
Integration, deployment, support of the NSFMiddleware Infrastructure for research &education
g
g
g
g
g
g
Selected Major Grid Projects
7/30/2019 Pendahuluan Paralel Komputer
126/167
Introduction to High Performance Computing
Name URL/Sponsor FocusGrid Application Dev.Software
hipersoft.rice.edu/grads; NSF
Research into program developmenttechnologies for Grid applications
Grid Physics Network griphyn.org
NSF
Technology R&D for data analysis in physicsexpts: ATLAS, CMS, LIGO, SDSS
Information Power
Grid
ipg.nasa.gov
NASA
Create and apply a production Grid for
aerosciences and other NASA missions
International VirtualData Grid Laboratory
ivdgl.org
NSF
Create international Data Grid to enable large-scale experimentation on Grid technologies &applications
Network for
Earthquake Eng.Simulation Grid
neesgrid.org
NSF
Create and apply a production Grid for
earthquake engineering
Particle Physics DataGrid
ppdg.net
DOE Science
Create and apply production Grids for dataanalysis in high energy and nuclear physicsexperiments
g
g
g
g
g
g
Selected Major Grid Projects
7/30/2019 Pendahuluan Paralel Komputer
127/167
Introduction to High Performance Computing
Name URL/Sponsor Focus
TeraGrid teragrid.org
NSF
U.S. science infrastructure linking four majorresource sites at 40 Gb/s
UK Grid SupportCenter
grid-support.ac.uk
U.K. eScience
Support center for Grid projects within the U.K.
Unicore BMBFT Technologies for remote access tosupercomputers
g
g
New
There are also many technology R&Dprojects: e.g., Globus, Condor, NetSolve,
Ninf, NWS, etc.
Example Application Projects
7/30/2019 Pendahuluan Paralel Komputer
128/167
Introduction to High Performance Computing
Earth Systems Grid: environment (US DOE)
EU DataGrid: physics, environment, etc. (EU)
EuroGrid: various (EU)
Fusion Collaboratory (US DOE)
GridLab: astrophysics, etc. (EU)
Grid Physics Network (US NSF)
MetaNEOS: numerical optimization (US NSF)
NEESgrid: civil engineering (US NSF)
Particle Physics Data Grid (US DOE)
Some Grid RequirementsSystems/Deployment Perspective
7/30/2019 Pendahuluan Paralel Komputer
129/167
Introduction to High Performance Computing
Systems/Deployment Perspective
Identity & authentication
Authorization & policy
Resource discovery
Resource characterization
Resource allocation
(Co-)reservation, workflow
Distributed algorithms
Remote data access
High-speed data transfer
Performance guarantees
Monitoring
Adaptation
Intrusion detection
Resource management
Accounting & payment
Fault management
System evolution
Etc.
Etc.
7/30/2019 Pendahuluan Paralel Komputer
130/167
The Systems Challenges:Resource Sharing Mechanisms That
7/30/2019 Pendahuluan Paralel Komputer
131/167
Introduction to High Performance Computing
Resource Sharing Mechanisms That
Address security and policy concerns ofresource owners and users
Are flexible enough to deal with manyresource types and sharing modalities
Scale to large number of resources, manyparticipants, many program components
Operate efficiently when dealing with largeamounts of data & computation
The Security Problem
7/30/2019 Pendahuluan Paralel Komputer
132/167
Introduction to High Performance Computing
Resources being used may be extremely valuable &the problems being solved extremely sensitive
Resources are often located in distinct administrativedomains
Each resource may have own policies & procedures The set of resources used by a single computation
may be large, dynamic, and/or unpredictable Not just client/server
It must be broadly available & applicable Standard, well-tested, well-understood protocols
Integration with wide variety of tools
The Resource Management Problem
7/30/2019 Pendahuluan Paralel Komputer
133/167
Introduction to High Performance Computing
Enabling secure, controlled remote access tocomputational resources and management ofremote computation
Authentication and authorization
Resource discovery & characterization Reservation and allocation
Computation monitoring and control
Grid Systems Technologies
7/30/2019 Pendahuluan Paralel Komputer
134/167
Introduction to High Performance Computing
Systems and security problems addressed bynew protocols & services. E.g., Globus:
Grid Security Infrastructure (GSI) for security
Globus Metadata Directory Service (MDS) for
discovery Globus Resource Allocations Manager (GRAM)
protocol as a basic building block Resource brokering & co-allocation services
GridFTP for data movement
The Programming Problem
7/30/2019 Pendahuluan Paralel Komputer
135/167
Introduction to High Performance Computing
How does a user develop robust, secure,long-lived applications for dynamic,heterogeneous, Grids?
Presumably need:
Abstractions and models to add tospeed/robustness/etc. of development
Tools to ease application development anddiagnose common problems
Code/tool sharing to allow reuse of codecomponents developed by others
Grid Programming Technologies
7/30/2019 Pendahuluan Paralel Komputer
136/167
Introduction to High Performance Computing
Grid applications are incredibly diverse (data,
collaboration, computing, sensors, ) Seems unlikely there is one solution
Most applications have been written from scratch,
with or without Grid services Application-specific libraries have been shown to
provide significant benefits
No new language, programming model, etc., has yet
emerged that transforms things But certainly still quite possible
Examples of GridProgramming Technologies
7/30/2019 Pendahuluan Paralel Komputer
137/167
Introduction to High Performance Computing
Programming Technologies MPICH-G2: Grid-enabled message passing
CoG Kits, GridPort: Portal construction, based on N-tier architectures
GDMP, Data Grid Tools, SRB: replica management,
collection management
Condor-G: simple workflow management
Legion: object models for Grid computing
Cactus: Grid-aware numerical solver framework Note tremendous variety, application focus
MPICH-G2: A Grid-Enabled MPI
7/30/2019 Pendahuluan Paralel Komputer
138/167
Introduction to High Performance Computing
A complete implementation of the MessagePassing Interface (MPI) for heterogeneous, widearea environments
Based on the Argonne MPICH implementation of MPI
(Gropp and Lusk)
Globus services for authentication, resourceallocation, executable staging, output, etc.
Programs run in wide area without change!
See also: MetaMPI, PACX, STAMPI, MAGPIE
www.globus.org/mpi
Grid Events
7/30/2019 Pendahuluan Paralel Komputer
139/167
Introduction to High Performance Computing
Global Grid Forum: working meeting Meets 3 times/year, alternates U.S.-Europe, withJuly meeting as major event
HPDC: major academic conference
HPDC-11 in Scotland with GGF-8, July 2002
Other meetings include
IPDPS, CCGrid, EuroGlobus, Globus Retreats
www.gridforum.org, www.hpdc.org
Useful References
7/30/2019 Pendahuluan Paralel Komputer
140/167
Introduction to High Performance Computing
Book (Morgan Kaufman) www.mkp.com/grids
Perspective on Grids The Anatomy of the Grid: Enabling Scalable
Virtual Organizations, IJSA, 2001 www.globus.org/research/papers/anatomy.pdf
All URLs in this section of the presentation,especially: www.gridforum.org, www.grids-center.org,
www.globus.org
Outline
http://www.globus.org/research/papers/anatomy.pdfhttp://www.gridforum.org/http://www.grids-center.org/http://www.globus.org/http://www.globus.org/http://www.grids-center.org/http://www.grids-center.org/http://www.grids-center.org/http://www.gridforum.org/http://www.globus.org/research/papers/anatomy.pdf7/30/2019 Pendahuluan Paralel Komputer
141/167
Introduction to High Performance Computing
Preface What is High Performance Computing?
Parallel Computing
Distributed Computing, Grid Computing, andMore
Future Trends in HPC
Value of Understanding Future Trends
7/30/2019 Pendahuluan Paralel Komputer
142/167
Introduction to High Performance Computing
Monitoring and understanding future trends inHPC is important:
users: applications should be written to beefficient on current and future architectures
developers:tools should be written to be efficienton current and future architectures
computing centers: system purchases areexpensive and should have upgrade paths
The Next Decade
7/30/2019 Pendahuluan Paralel Komputer
143/167
Introduction to High Performance Computing
1980s and 1990s: academic and government requirements stronglyinfluenced parallel computing architectures
academic influence was greatest in developingparallel computing software (for science & eng.)
commercial influence grew steadily in late 1990s
In the next decade: commercialization will become dominant in
determining the architecture ofsystems academic/research innovations will continue to
drive the development of the HPC software
Commercialization
7/30/2019 Pendahuluan Paralel Komputer
144/167
Introduction to High Performance Computing
Computing technologies (including HPC) arenow propelled by profits, not sustained bysubsidies
Web servers, databases, transaction processing
and especially multimedia applications drive theneed for computational performance.
Most HPC systems are scaled up commercial
systems, with less additional hardware and
software compared to commercial systems. Its not engineering, its economics.
Processors and Nodes
7/30/2019 Pendahuluan Paralel Komputer
145/167
Introduction to High Performance Computing
Easy predictions: microprocessors performance increase continuesat ~60% per year (Moores Law) for 5+ years.
total migration to 64-bit microprocessors
use of even more cache, more memory hierarchy. increased emphasis on SMPs
Tougher predictions:
resurgence of vectors in microprocessors? Maybe dawn of multithreading in microprocessors? Yes
Building Fat Nodes: SMPs
7/30/2019 Pendahuluan Paralel Komputer
146/167
Introduction to High Performance Computing
More processors are faster, of course SMPs are simplest form of parallel systems efficient if not limited by memory bus contention:
small numbers of processors
Commercial market for high performanceservers at low cost drives needfor SMPs
HPC market for highest performance, ease of
programming drives developmentof SMPs
Building Fat Nodes: SMPs
7/30/2019 Pendahuluan Paralel Komputer
147/167
Introduction to High Performance Computing
Trends are to: build bigger SMPs attempt to share memory across SMPs (cc-
NUMA)
Resurgence of Vectors
7/30/2019 Pendahuluan Paralel Komputer
148/167
Introduction to High Performance Computing
Vectors keep functional units busy
vector registers are veryfast
vectors are more efficient for loops of any stride
vectors are greatfor many science & eng. apps
Possible resurgence of vectors SGI/Cray plans has built SV1ex, building SV2
NEC continues building (CMOS) parallel-vector,Cray-like systems
Microprocessors (Pentium4, G4) have addedvector-like functionality for multimedia purposes
Dawn of Multithreading?
7/30/2019 Pendahuluan Paralel Komputer
149/167
Introduction to High Performance Computing
Memory speed will always be a bottleneck Must overlap computation with memory
accesses: toleratelatency
requires immense amount of parallelism
requires processors with multiple streams andcompilers that can define multiple threads
Multithreading Diagram
7/30/2019 Pendahuluan Paralel Komputer
150/167
Introduction to High Performance Computing
Multithreading
7/30/2019 Pendahuluan Paralel Komputer
151/167
Introduction to High Performance Computing
Tera MTA was first multithreaded HPCsystem
scientific success, production failure
MTA-2 will be delivered in a few months.
Multithreading will be implemented (in morelimited fashion) in commercial processors.
Networks
7/30/2019 Pendahuluan Paralel Komputer
152/167
Introduction to High Performance Computing
Commercial network bandwidth and latencyapproaching custom performance.
Dramatic performance increases likely
the network is the computer (Sun slogan)
more companies, more competition
no severe physical, economic limits
Implications of faster networks
more clusters
collaborative, visual supercomputing
Grid computing
Commodity Clusters
7/30/2019 Pendahuluan Paralel Komputer
153/167
Introduction to High Performance Computing
Clusters provide some real advantages: computing power: leverage workstations and PCs high availability: replace one at a time
inexpensive: leverage existing competitive market
simple path to installing parallel computing system
Major disadvantages were robustness ofhardware and software, but both have
improved NCSA has huge clusters in production based
on Pentium III and Itanium.
Clustering SMPs
7/30/2019 Pendahuluan Paralel Komputer
154/167
Introduction to High Performance Computing
Inevitable (already here!): leverages SMP nodes effectively for same
reasons clusters leverage individual processors
Commercial markets drive need for SMPs
Combine advantages of SMPs, clusters more powerful nodes through multiprocessing
more powerful nodes -> more powerful cluster
Interconnect scalability requirements reduced for
same number of processors
Continued Linux Growth in HPC
7/30/2019 Pendahuluan Paralel Komputer
155/167
Introduction to High Performance Computing
Linux popularity growing due to price andavailability of source code
Major players now supporting Linux, esp. IBM
Head start on Intel Itanium
Programming Tools
7/30/2019 Pendahuluan Paralel Komputer
156/167
Introduction to High Performance Computing
However, programming tools will continue tolag behind hardware and OS capabilities:
Researchers will continue to drive the need for themost powerful tools to create the most efficientapplications on the largest systems
Such technologies will look more like MPI than theWeb maybe worse due to multi-tiered clusters ofSMPs (MPI + OpenMP; Active messages +threads?).
Academia will continue to play a large role in HPCsoftware development.
Grid Computing
7/30/2019 Pendahuluan Paralel Komputer
157/167
Introduction to High Performance Computing
Parallelism will continue to grow in the form of
SMPs
clusters
Cluster of SMPs (and maybe DSMs)
Grids provide the next level
connects multiple computers into virtual systems
Already here: IBM, other vendors supporting Globus
SC2001 dominated by Grid technologies
Many major government awards (>$100M in past year)
Emergence of Grids
7/30/2019 Pendahuluan Paralel Komputer
158/167
Introduction to High Performance Computing
But Grids enable much more than appsrunning on multiple computers (which can beachieved with MPI alone)
virtual operating system: provides global
workspace/address space via a single login automatically manages files, data, accounts, and
security issues
connects other resources (archival data facilities,
instruments, devices) and people (collaborativeenvironments)
Grids Are Inevitable
7/30/2019 Pendahuluan Paralel Komputer
159/167
Introduction to High Performance Computing
Inevitable (at least in HPC):
leverages computational power of all availablesystems
manages resources as a single system--easier forusers
provides most flexible resource selection andmanagement, load sharing
researchers desire to solve bigger problems will
always outpace performance increases of singlesystems; just as multiple processors are needed,multiple multiprocessors will be deemed so
Grid-Enabled Software
7/30/2019 Pendahuluan Paralel Komputer
160/167
Introduction to High Performance Computing
Commercialapplications on single parallelsystems and Grids will require that:
underlying architectures must be invisible: noparallel computing expertise required
usage must be simple development must not be to difficult
Developments in ease-of-use will benefitscientists as users(not as developers)
Web-based interfaces: transparentsupercomputing(MPIRE, Meta-MEME, etc.).
Grid-Enabled Collaborative andVisual Supercomputing
7/30/2019 Pendahuluan Paralel Komputer
161/167
Introduction to High Performance Computing
Commercial world demands:
multimedia applications
real-time data processing
online transaction processing
rapid prototyping and simulation in engineering,chemistry and biology
interactive, remote collaboration
3D graphics, animation and virtual reality
visualization
Grid-enabled Collaborative, VisualSupercomputing
7/30/2019 Pendahuluan Paralel Komputer
162/167
Introduction to High Performance Computing
Academic world will leverage resulting Gridslinking computing and visualization systemsvia high-speed networks:
collaborative post-processing of data already here
simulations will be visualized in 3D, virtual worldsin real-time
such simulations can then be steered
multiple scientists can participate in these visual
simulations the time to insight (SGI slogan) will be reduced
Web-based Grid Computing
7/30/2019 Pendahuluan Paralel Komputer
163/167
Introduction to High Performance Computing
Web currently used mostly for contentdelivery
Web servers on HPC systems can executeapplications
Web servers on Grids can launchapplications, move/store/retrieve data, displayvisualizations, etc.
NPACI HotPage already enables single sign-on to NPACI Grid Resources
Summary of Expectations
7/30/2019 Pendahuluan Paralel Komputer
164/167
Introduction to High Performance Computing
HPC systemswill grow in performance butprobably change little in design (5-10 years): HPC systems will be larger versions of smaller
commercial systems, mostly large SMPs andclusters of inexpensive nodes
Some processors will exploit vectors, as well asmore/larger caches.
Best HPC systems will have been designed top-down instead of bottom-up, but all will have beendesigned to make the bottom profitable.
Multithreading is the only likely, near-term majorarchitectural change.
Summary of Expectations
7/30/2019 Pendahuluan Paralel Komputer
165/167
Introduction to High Performance Computing
Using HPC systemswill change much more:
Grid computing will become widespread in HPCand in commercial computing
Visual supercomputing and collaborativesimulation will be commonplace.
WWW interfaces to HPC resources will maketransparent supercomputing commonplace.
But programmingthe most powerfulresources most effectivelywill remain difficult.
Caution
7/30/2019 Pendahuluan Paralel Komputer
166/167
Introduction to High Performance Computing
Change is difficult to predict (and I am anastrophysicist, not an astrologer):
Accuracy of linear extrapolation predictionsdegrade over long times (like weather forecasts)
Entirely new ideas can change everything: WWW is an excellent example; Grid computing isprobably the next
Eventually, something truly different will replace CMOStechnology (nanotechnology? molecular computing?
DNA computing?)
Final Prediction
7/30/2019 Pendahuluan Paralel Komputer
167/167
The thing about change is that
things will be different afterwards.
Alan McMahon (Cornell University)