Hardware Parallel Architecture of a 3D Surface Reconstruction
Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term...
Transcript of Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term...
Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014
Dr. Peter Tröger, M.Sc. Frank Feinbube
Shared-Memory Hardware
■ Hardware architecture: Processor(s), memory system(s), data path(s) □ Each component may become the performance bottleneck □ Each component can be replicated
□ Each parallelization target must be handled separately ■ Modern processor
□ Multiple instructions in the same cycle □ Multiple concurrent instruction streams per functional unit
□ Multiple functional units (cores) ■ Combination of multiple processors with one shared memory ■ Logical hardware setup as seen by the software,
physical hardware organization typically differs
Shared-Memory Hardware
3 Fetch instruction,
update program counter (PC)
Decode instruction
Execute instruction
Write Back result
Shared-Memory Hardware
■ Central Processing Units (CPUs) + volatile memory + I/O devices
■ Fetch instruction and execute it - typically memory access, computation, and / or I/O
4 [Stallings]
■ I/O devices and memory controller may interrupt the instruction processing
■ Improve processor utilization by asynchronous operations
RISC vs. CISC
■ RISC - Reduced Instruction Set Computer □ MIPS, ARM, DEC Alpha, Sparc, IBM 801, Power, etc. □ Small number of instructions □ Few data types in hardware □ Instruction size constant, few addressing modes □ Relies on optimization in software
■ CISC - Complex Instruction Set Computer
□ VAX, Intel X86, IBM 360/370, etc. □ Large number of complex instructions, may take multiple cycles □ Variable length instructions □ Smaller code size □ Focus on optimization in hardware
■ RISC designs lend themself to exploitation of instruction level parallelism
Shared-Memory Hardware
■ Major constraints of memory are amount, speed, and costs
□ Faster access time results in greater costs per bit □ Greater capacity results in smaller costs per bit
□ Greater capacity results in slower access ■ Going down the memory hierarchy
□ Decreasing costs per bit
□ Increasing capacity for fixed costs
□ Increasing access time ■ I/O devices provide
non-volatile memory on lower levels, which is an additional advantage
6
http
://t
jliu.
myw
eb.h
inet
.net
/
Shared-Memory Hardware
■ Principle of Locality □ Memory referenced by a processor (program and data) tends
to cluster (e.g. loops, subroutines) □ Operations on tables and arrays involve access to
clustered data sets □ Temporal locality: If a memory location is referenced, it will
tend to be referenced again soon □ Spatial locality: If a memory location is referenced, locations
whose addresses are close by will tend to be referenced soon ■ Data should be organized so that the
percentage of accesses to lower levels is substantially less than to the level above
■ Typically implemented by caching concept
7
[Stallings]
Shared-Memory Hardware
■ Caching □ Offer a portion of lower
level memory as copy in the faster smaller memory
□ Leverages the principle of locality
□ Processor caches work in hardware, but must be considered by an operating system
8
[Stallings]
Shared-Memory Hardware
■ Conflicting caching design goals □ Cache size per level
□ Number of cache levels □ Block size exchanged with
lower level memory □ Replacement algorithm □ Mapping function □ Write policy for modified
cache lines ■ All decisions made by hardware
vendor, considerable by software ■ Cache-optimized software needed when parallelization
improvements start to depend on memory bottlenecks
9
[Stallings]
Parallel Processing
■ Inside the processor □ Instruction-level
parallelism (ILP) □ Multicore
□ Shared memory ■ With multiple processing
elements in one machine □ Multiprocessing □ Shared memory
■ With multiple processing elements in many machines □ Multicomputer
□ Shared nothing (in terms of a globally accessible memory)
10
Instruction-Level Parallelism
■ Hardware optimizes sequential instruction stream execution □ Pipelining architecture
◊ Sub-steps of sequential instructions are overlapped in their execution to increase throughput
◊ Traditional concept in processor hardware design ◊ Relies on mechanisms such as branch prediction or the
out-of-order execution of instructions □ Superscalar architecture ◊ Execution of multiple instructions in parallel, based on
redundant functional units of the processor ◊ Very Long Instruction Word (VLIW)
◊ Explicitly parallel instruction computing (EPIC) ◊ SIMD vectorization support with special instructions
11
Pipelining
12
Pipelining
■ Pipelining overlaps various stages of instruction execution ■ Fetch, decode and execute happen in parallel
□ Increases instruction throughput with the same clock speed ■ Analogue to assembly line concept ■ Pipelining hazard
□ Temporal dependencies between sub-steps
□ Influence on speedup □ Structural hazard: Multiple instructions access the same
resource □ Data hazard: Instruction needs the result of the previous
instruction □ Control hazard: Instruction result changes the control flow
(interrupt, branch)
13
Multi-Cycle Pipelining
14
Pipelining Conflicts
■ Conflict solution strategies □ Multiply commonly needed hardware units
□ Inclusion of NOPs for changing the timing □ Reorder the instructions for changing the timing ◊ Write-after-read effect: Instruction uses a register value
that is overwritten by a subsequent instruction ◊ Write-after-write effect: Two subsequent instructions
write into the same register ◊ Read-After-write effect: Reliance on previous result
□ Stall the pipeline until the conflict is solved
■ Control conflicts are targeted by branch prediction □ Static branch prediction (forward never - backward ever) vs.
dynamic branch prediction (based on previous jumps) ■ Issues typically targeted by compiler and processor hardware
15
Superscalar Architectures
16
Superscalar Architectures
■ Good for problems with high degree of regularity, such as graphics/image processing
■ Typically exploit data parallelism ■ Today: GPGPU Computing, Cell
processor, SSE, AltiVec
17
ILLI
AC I
V (
1974
) Cra
y Y-
MP
Thin
king
Mac
hine
s CM
-2 (
1985
)
Ferm
i GPU
Superscalar Architectures
■ Vector instructions for high-level operations on data sets ■ Became famous with Cray architecture in the 70‘s
■ Today, vector instructions are part of the standard instruction set □ AltiVec □ Streaming SIMD Extensions (SSE)
■ Example: Vector addition
18
vec_res.x = v1.x + v2.x;!vec_res.y = v1.y + v2.y;!vec_res.z = v1.z + v2.z;!vec_res.w = v1.w + v2.w;!
movaps xmm0,address-of-v1 !(xmm0=v1.w | v1.z | v1.y | v1.x) !!addps xmm0,address-of-v2 !(xmm0=v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x) !!movaps address-of-vec_res,xmm0!
Streaming SIMD Extensions (SSE)
■ Introduced by Intel with the Pentium III (1999) ■ Specifically designed for floating point and vector operations
■ New 128 Bit registers can be packed with four 32 bit scalars □ Operation is performed simultaneously on all of them
■ Typical operations □ Move data between SSE registers and 32b registers / memory
□ Add, subtract, multiply, divide, square root, maximum, minimum, reciprocal, compare, bitwise AND / OR / XOR
■ Available as compiler intrinsic □ Function known by the compiler that maps to assembler □ Better performance than with linked library
19
Other Instruction Set Extensions
■ Fused Multiply-Add instructions (FMA) □ Supported in different variations by all processors
□ Floating point multiply-add operation performed in one step □ Improves speed and accuracy of product accumulation ◊ Scalar product ◊ Matrix multiplication
◊ Efficient software implementation of square root and division ■ Intel Advanced Vector Extensions (AVX)
□ Extension of SSE instruction set □ Introduced with Sandy Bridge architecture (2011)
□ Registers are now 256 Bit wide □ 512 bit support announced for 2015 version of Xeon Phi
20
Very Long Instruction Word (VLIW)
■ Very Long Instruction Word (VLIW), Fisher et al., 1980‘s □ Compiler identifies instructions to be executed in parallel
□ One VLIW instruction encodes several operations (at least one for each redundant execution unit)
□ Less hardware complexity, higher compiler complexity □ VLIW processors typically designed with multiple RISC
execution units □ Very popular in the embedded market and in GPU hardware
■ Explicitely Parallel Instruction Computing (EPIC) □ Coined by HP-Intel alliance since 1997
□ Foundational concept for the Intel Itanium architecture □ Extended version of VLIW concept □ Turned out to be extremely difficult for compilers
21
EPIC
■ 64-bit register-rich explicitly-parallel architecture ■ Implements predication, speculation, and branch prediction
□ Hardware register renaming for parameter passing
□ Parallel execution of loops
■ Speculation, prediction, and renaming controlled by compiler
□ Each 128-bit instruction word contains three instructions
□ Stop-bits control parallel execution
□ Processor can execute six instructions per clock cycle
□ Thirty execution units for subsets of instruction set in eleven groups
□ Each unit executes at a rate of one instruction per cycle (unless stall)
□ Common instructions can be executed in multiple units
22
Itanium – 30 Functional Units
23
Simultaneous Multi-Threading (SMT)
■ Reasons for bad performance in superscalar architectures depend on application
■ Dynamically schedule the functional unit usage
■ Support multiple instruction streams in one pipeline
24
[Tullsen et al., 1995]
Hyperthreading
■ Intel‘s implementation of simultaneous multi-threading (SMT) ■ Allows an execution core to function as two logical processors
■ Main goal is to reduce the number of related instructions being in the pipeline at the same time
■ Works nicely on cache miss, branch misprediction, or data dependencies in one of the threads
■ Most core hardware resources are shared □ Caches, execution units, buses
■ Each logical processor has an own architectural set □ Register bank is mirrored
■ Mainly enables very fast thread context switch in pure hardware ■ More than two logical threads per core would saturate the memory
connection and pollute the caches
25
Hyperthreading
[Intel]
Parallel Processing
■ Inside the processor □ Instruction-level
parallelism (ILP) □ Multicore
□ Shared memory ■ With multiple processing
elements in one machine □ Multiprocessing □ Shared memory
■ With multiple processing elements in many machines □ Multicomputer
□ Shared nothing (in terms of a globally accessible memory)
27
Chip Multi-Processing
■ One integrated circuit die (socket) contains multiple computational engines (core) □ Called many-core or multi-core architecture □ Cores share some / all cache levels and memory connection
□ All other parts are dedicated per core (pipeline, registers, ...) ■ Increase in core count leads to resource contention problem with
caches and memory ■ Beside Intel / AMD, also available with ARM, MIPS, PPC ■ Multi-Core vs. SMP
□ SMP demands more replicated hardware (fans, bus, ...)
□ SMP is a choice, multi-core is given by default □ Cores typically have lower clock frequency
■ Multi-Core and SMP programming problems are very similar ■ Recent trends towards heterogeneous cores
28
Many-Core / Multi-Core
Intel Core i7 SPARC64 VIIIfx
Parallel Processing
■ Inside the processor □ Instruction-level
parallelism (ILP) □ Multicore
□ Shared memory ■ With multiple processing
elements in one machine □ Multiprocessing □ Shared memory
■ With multiple processing elements in many machines □ Multicomputer
□ Shared nothing (in terms of a globally accessible memory)
30
Multiprocessor: Flynn‘s Taxonomy (1966)
■ Classify multiprocessor architectures among instruction and data processing dimension
31
Sing
le In
stru
ctio
n,
Sing
le D
ata
(SIS
D)
(C)
Bla
ise
Bar
ney
Sing
le In
stru
ctio
n,
Mul
tiple
Dat
a (S
IMD
)
Mul
tiple
Inst
ruct
ion,
Sing
le D
ata
(MIS
D)
Sing
le In
stru
ctio
n,
Sing
le D
ata
(SIS
D)
Mul
tiple
Inst
ruct
ion,
Mul
tiple
Dat
a (M
IMD
)
Multiprocessor Systems
■ Symmetric Multiprocessing (SMP) □ Set of equal processors in one system
(more SM-MIMD than SIMD) □ Traditionally memory bus, today on-chip network
□ Demands synchronization and operating system support ■ Asymmetric multiprocessing (ASMP)
□ Specialized processors for I/O, interrupt handling or operating system (DEC VAX 11, OS-360, IBM Cell processor)
□ Typically master processor with main memory access and slaves
32
Symmetric Multi-Processing
■ Two or more processors in one system, can perform the same operations (symmetric)
■ Processors share the same main memory and all devices
■ Increased performance and scalability for multi-tasking
■ No master, any processor can cause another to reschedule
■ Challenges for an SMP operating system:
□ Reentrant kernel, scheduling policies, synchronization, memory re-use, ...
[Sta
lling
s]
Shared Memory
■ All processors act independently and use the same global address space, changes in one memory location are visible for all others
■ Uniform memory access (UMA) system □ Equal load and store access for all processors to all memory
□ Default approach for SMP systems of the past ■ Non-uniform memory access (NUMA) system
□ Groups of physical processors (called “nodes”) that have local memory, connected by some interconnect
□ Still an SMP system (e.g. any processor can access all of memory), but node-local memory is faster
□ OS tries to schedule close activities on the same node
□ Became the default model in shared memory architectures □ Cache-coherent NUMA (CC-NUMA) in hardware
34
UMA Example
35
Shared-Memory with UMA
� Two dual core chips (2 core/socket)� P = Processor core� L1D = Level 1 Cache – Data (fastest)� L2 = Level 2 Cache (fast)� Memory = main memory (slow)� Chipset = enforces cache coherence and
mediates connections to memory
Lecture 1 – HPC and Big Data
� UMA systems use ‘flat memory model’: Latencies and bandwidth are the same for all processors and all memory locations.
� Also called Symmetric Multiprocessing (SMP)
[3] Introduction to High Performance Computing for Scientists and Engineers
13 / 37
NUMA Example
36
Shared-Memory with ccNUMA
� Eight cores (4 cores/socket); L3 = Level 3 Cache� Memory interface = establishes a coherent link to enable one
‘logical’ single address space of ‘physically distributed memory’
Lecture 1 – HPC and Big Data
� ccNUMA systems share logically memory that is physically distributed (similar like distributed-memory systems)
� Network logic makes the aggregated memory appear as one single address space
[3] Introduction to High Performance Computing for Scientists and Engineers
14 / 37
NUMA Example: Intel Nehalem
37
Core Core
Core Core
Q P I
Core Core
Core Core
Q P I
Core Core
Core Core
Q P I
Core Core
Core Core
Q P I L3
Cac
he
L3 C
ache
L3 C
ache
Mem
ory
Con
trolle
r
Mem
ory
Con
trolle
r M
emor
y C
ontro
ller
L3 C
ache
Mem
ory
Con
trolle
r I/O I/O
I/O I/O
Mem
ory
Mem
ory
Mem
ory
Mem
ory
CC-NUMA
38
[Sch
öne
et a
l.] ■ Central crossbar
for interaction of cores, memory controller and other processors via QPI
■ Similar approach by other vendors
■ Extended versions of MESI cache coherence protocol being used for L3 management
UN
CO
RE
CC-NUMA
39 ■ Cache coherency in a multi-core multi-socket system
□ Extended problem of traditional cache coherency problem in multi-socket SMP systems
■ Application of extended MESI cache coherence protocol in QPI
□ Each cache line has one state ◊ Modified – Written by the local core ◊ Exclusive – First read by the local core ◊ Shared – Read by two cores (cache hit) ● Write attempt in this state lead to cache invalidation ● New state is modified
◊ Invalid – Cache line contains no valid data (read miss)
◊ Forwarding (new) - Direct L3 exchange of data □ Can be optimized by snooping into other caches
Hypertransport
■ Specification of I/O interconnect, originally developed by AMD, Alpha and API Networks in 2001
■ Point-to-point unidirectional links between components □ At least one host device
(typically processor) □ Bridge functionality to PCI,
PCI-X, PCI Express, ... □ Tunnel devices connect a link
to other HAT devices ■ Extremely low overhead, suitable
for inter-processor communication in SMP hardware
40
[hyp
ertr
ansp
ort.
org]
Hypertransport
41
[hyp
ertr
ansp
ort.
org]
Quick Path Interconnect (QPI)
■ Competing technology from Intel, since 2008 ■ Result of a continuous improvement in Intel processor
interconnect technology
42
An Introduction to the Intel® QuickPath Interconnect 7
Figure 3. Shared Front-side Bus, up until 2004
To further increase the bandwidth of the front-side bus based platforms, the single-shared bus approach evolved into dual independent buses (DIB), as depicted in Figure 4. DIB designs essentially doubled the available bandwidth. However, all snoop traffic had to be broadcast on both buses, and if left unchecked, would reduce effective bandwidth. To minimize this problem, snoop filters were employed in the chipset to cache snoop information, thereby significantly reducing bandwidth loading.
Figure 4. Dual Independent Buses, circa 2005
The DIB approach was extended to its logical conclusion with the introduction of dedicated high-speed interconnects (DHSI), as shown in Figure 5. DHSI-based platforms use four FSBs, one for each processor in the platform. Again, snoop filters were employed to achieve bandwidth scaling.
Figure 5. Dedicated High-speed Interconnects, 2007
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 4.2GB/s Platform Bandwidth
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 12.8GB/s Platform Bandwidth
snoop filter
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 34GB/s Platform Bandwidth
snoop filter
An Introduction to the Intel® QuickPath Interconnect 7
Figure 3. Shared Front-side Bus, up until 2004
To further increase the bandwidth of the front-side bus based platforms, the single-shared bus approach evolved into dual independent buses (DIB), as depicted in Figure 4. DIB designs essentially doubled the available bandwidth. However, all snoop traffic had to be broadcast on both buses, and if left unchecked, would reduce effective bandwidth. To minimize this problem, snoop filters were employed in the chipset to cache snoop information, thereby significantly reducing bandwidth loading.
Figure 4. Dual Independent Buses, circa 2005
The DIB approach was extended to its logical conclusion with the introduction of dedicated high-speed interconnects (DHSI), as shown in Figure 5. DHSI-based platforms use four FSBs, one for each processor in the platform. Again, snoop filters were employed to achieve bandwidth scaling.
Figure 5. Dedicated High-speed Interconnects, 2007
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 4.2GB/s Platform Bandwidth
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 12.8GB/s Platform Bandwidth
snoop filter
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 34GB/s Platform Bandwidth
snoop filter
Traditional Shared Frontside Bus (until 2004)
Dual Independent Buses (until 2005)
[int
el.c
om]
Quick Path Interconnect (QPI)
43
Dedicated Interconnects (until 2007)
Quick Path Interconnect
8 An Introduction to the Intel® QuickPath Interconnect
With the production of processors based on next-generation, 45-nm Hi-k Intel® Core™ microarchitecture, the Intel® Xeon® processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using Intel® QuickPath Interconnects. This configuration is shown in Figure 6. With its narrow uni-directional links based on differential signaling, the Intel® QuickPath Interconnect is able to achieve substantially higher signaling rates, thereby delivering the processor interconnect bandwidth necessary to meet the demands of future processor generations.
Figure 6. Intel® QuickPath Interconnect
Interconnect Overview
The Intel® QuickPath Interconnect is a high-speed point-to-point interconnect. Though sometimes classified as a serial bus, it is more accurately considered a point-to-point link as data is sent in parallel across multiple lanes and packets are broken into multiple parallel transfers. It is a contemporary design that uses
some techniques similar to other point-to-point interconnects, such as PCI Express* and Fully-Buffered DIMMs. There are, of course, some notable differences between these approaches, which reflect the fact that these interconnects were designed for different applications. Some of these similarities and differences will be explored later in this paper.
Figure 7 shows a schematic of a processor with external Intel® QuickPath Interconnects. The processor may have one or more cores. When multiple cores are present, they may share caches or have separate caches. The processor also typically has one or more integrated memory controllers. Based on the level of scalability supported in the processor, it may include an integrated crossbar router and more than one Intel® QuickPath Interconnect port (a port contains a pair of uni-directional links).
Figure 7. Block Diagram of Processor with Intel® QuickPath Interconnects
MemoryInterface
I/O
MemoryInterface
MemoryInterface
MemoryInterface
chipset
I/O
chipset
processor
processor processor
processor
Legend:Bi-directional busUni-directional link
core core core
Inte
grat
edM
emor
yC
ontro
ller(s
)
Crossbar Router /Non-routing
global links interface
MemoryInterface
Processor Cores
Intel®QuickPath
interconnects
An Introduction to the Intel® QuickPath Interconnect 7
Figure 3. Shared Front-side Bus, up until 2004
To further increase the bandwidth of the front-side bus based platforms, the single-shared bus approach evolved into dual independent buses (DIB), as depicted in Figure 4. DIB designs essentially doubled the available bandwidth. However, all snoop traffic had to be broadcast on both buses, and if left unchecked, would reduce effective bandwidth. To minimize this problem, snoop filters were employed in the chipset to cache snoop information, thereby significantly reducing bandwidth loading.
Figure 4. Dual Independent Buses, circa 2005
The DIB approach was extended to its logical conclusion with the introduction of dedicated high-speed interconnects (DHSI), as shown in Figure 5. DHSI-based platforms use four FSBs, one for each processor in the platform. Again, snoop filters were employed to achieve bandwidth scaling.
Figure 5. Dedicated High-speed Interconnects, 2007
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 4.2GB/s Platform Bandwidth
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 12.8GB/s Platform Bandwidth
snoop filter
processor processor processor
chipsetMemoryInterface
I/O
processor
Up to 34GB/s Platform Bandwidth
snoop filter
[int
el.c
om]
Scalable Coherent Interface
■ ANSI / IEEE standard for NUMA interconnect, used in HPC world □ 64bit global address space,
translation by SCI bus adapter (I/O-window) ■ Used as 2D / 3D torus
44
Processor A Processor B
Cache Cache
Memory
Processor C Processor D
Cache Cache
Memory SCI Cache
SCI Bridge
SCI Cache
SCI Bridge
...
Theoretical Models for Parallel Hardware
■ Better use simplified parallel machine model than real hardware specification for parallelization optimization □ Allows theoretical investigation of algorithms □ Allows generic optimization, regardless of products
□ Should improve algorithm robustness by avoiding optimizations to hardware layout specialties (e.g. network topology)
□ Became popular in the 70‘s and 80‘s, due to large diversity in parallel hardware design
■ Resulting computational model is independent from the programming model for the implementation
■ Vast body of theoretical research results ■ Typically, formal models adopt to hardware developments
45
(Parallel) Random Access Machine
■ RAM assumptions: Constant memory access time, unlimited memory
■ PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors
■ Alternative models: BSP, LogP
46
CPU
Input Memory Output
CPU CPU
Shared Bus
CPU
Input Memory Output
PRAM Extensions
■ Rules for memory interaction to classify hardware support of a PRAM algorithm
■ Memory access assumed to be in lockstep (synchronous PRAM) ■ Concurrent Read, Concurrent Write (CRCW)
□ Multiple tasks may read from / write to the same location at the same time, can be simulated with EREW
■ Concurrent Read, Exclusive Write (CREW) □ One task may write to a given memory location at any time
■ Exclusive Read, Concurrent Write (ERCW) □ One task may read from a given memory location
at any time ■ Exclusive Read, Exclusive Write (EREW)
□ One task may read from / write to a memory location at any time, memory management must know concurrency
47
PRAM Extensions
■ Concurrent write scenario needs further specification by algorithm □ Ensures that the same value is written
□ Selection of arbitrary value from parallel write attempts □ Priority of written value derived from processor ID □ Store result of combining operation (e.g. sum) into memory
■ PRAM algorithm can act as starting point for a real implementation
□ Unlimited resource assumption □ Allows to map ,logical‘ PRAM processors to a restricted number
of physical processors □ Enables the design scalable algorithm based on unlimited
memory assumption □ Focus only on concurrency opportunities,
synchronization and communication later
48
Example: Parallel Sum
■ General parallel sum operation works with any associative and commutative combining operation □ Multiplication, maximum, minimum, logical operations, …
■ PRAM solution
□ Build binary tree, with input data items as leaf nodes □ Internal nodes hold the sum, root node as global sum □ Additions on one level are independent from each other
■ PRAM algorithm
◊ One processor per leaf node, in-place summation ◊ Computation in O(log2n)
49
int sum=0; for (int i=0; i<N; i++) { sum += A[i]; }
Example: Parallel Sum
■ Example: n=8: □ l=1: Partial sums in X[1], X[3], X[5], [7]
□ l=2: Partial sums in X[3] and X[7] □ l=3: Parallel sum result in X[7]
■ Correctness relies on PRAM lockstep assumption (no synchronization)
50
for all l levels (1..log2n){ for all i items (0..n-1) { if (((i+1) mod 2^l) = 0) then X[i] := X[i-2^(l-1)]+X[i] } }
Bulk-Synchronous Parallel (BSP) Model
■ Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 ■ Success of von Neumann model
□ Bridge between hardware and software □ High-level languages can be efficiently compiled on this model □ Hardware designers can optimize the realization of this model
■ Similar model for parallel machines
□ Should be neutral about the number of processors □ Program should be written for v virtual processors that are
mapped to p physical ones □ When v >> p, the compiler has options
■ BSP computation consists of a series of supersteps: □ 1.) Concurrent computation on all processors
□ 2.) Exchange of data between all processes □ 3.) Barrier synchronization
51
Bulk-Synchronous Parallel (BSP) Model
■ Costs of a superstep depend on □ The costs for the slowest computation
□ The costs for communication between all processes □ The costs for barrier synchronization
■ Algorithm costs relate to the sum of all superstep costs ■ Synchronization may only happen for some processes
□ Long-running serial tasks are not slowed down from model perspective
■ Recent industrial uptake with Pregel and ML language ■ Apache Hama project implements BSP on top of Hadoop
52
Bulk-Synchronous Parallel (BSP) Model
■ Bulk-synchronous parallel computer (BSPC) is defined by: □ Components, each performing processing and / or memory
functions □ Router that delivers messages between pairs of components
□ Facilities to synchronize components at regular intervals L (periodicity)
■ Computation consists of a number of supersteps □ Each L, global check is made if the superstep is completed
■ Router concept splits computation vs. communication aspects, and models memory / storage access explicitly
■ L is controlled by the application, even at run-time
53
LogP [Culler et al., 1993]
■ Criticism on overly simplification in PRAM-based approaches, encourage exploitation of ,formal loopholes‘ (e.g. communication)
■ Trend towards multicomputer systems with large local memories ■ Characterization of a parallel machine by:
□ P: Number of processors □ g (gap): Minimum time between two consecutive transmissions ◊ Reciprocal corresponds to per-processor communication
bandwidth □ L (latency): Upper bound on messaging time □ o (overhead): Exclusive processor time needed for send /
receive operation ■ L, o, G in multiples of processor cycles
54
LogP Architecture Model
55
LogP
■ Algorithm must produce correct results under all message interleaving, prove space and time demands of processors
■ Simplifications □ With infrequent communication, bandwidth limits (g) are not
relevant □ With streaming communication, latency (L) may be
disregarded ■ Convenient approximation:
Increase overhead (o) to be as large as gap (g) ■ Encourages careful scheduling of computation, and overlapping of
computation and communication ■ Can be mapped to shared-memory and shared nothing
□ Reading a remote location requires 2L+4o processor cycles
56
LogP
■ Matching the model to real machines □ Saturation effects: Latency increases as function of the
network load, sharp increase at saturation point - captured by capacity constraint
□ Internal network structure is abstracted, so ,good‘ vs. ,bad‘ communication patterns are not distinguished - can be modeled by multiple g‘s
□ LogP does not model specialized hardware communication primitives, all mapped to send / receive operations
□ Separate network processors can be explicitly modeled
■ Model defines 4-dimensional parameter space of machines □ Vendor product line can be identified by a curve in this space
57
LogP – Optimal Broadcast Tree
58
LogP – Optimal Summation
59