scalable parallel computing

1 of 11 K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-

Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Scalable Parallel Computing: Technology, Architecture, Programming

K. Hwang and Z. Xu, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Chapter 1: Scalable Computer Platforms and Models (p. 3-50)

Evolution of Computer Architectures Five generations of machines

Scalable Computer Architectures Functionality and Performance Scaling in Cost Compatibility

System Architectures Shared Nothing Shared Disk Shared Memory

Macro-Architecture vs. Micro-Architecture

Dimensions of Scalability Resource Scalability Application Scalability Technology Scalability

Parallel Computer Models: Semantic Attributes Homogeneity Synchrony Interaction Mechanism Address Space Memory Model

Performance Attributes Machine size, clock rate, workload, sequential execution, parallel execution, speed, speedup, efficiency, utilization, startup time, asymptotic bandwidth

Abstract Machine Models: PRAM:

Tcomp and Tload imbalance, simple, shared variable



Bulk Synchronous Parallel: Tcom, Tload imbalance, Tcommunication, and Tsynchronization Includes interaction overhead Superstep execution: comp, interact, synch

Phase Parallel: Tcom, Tload imbalance, Tcommunication, Tsynchronization, and Tparallel Includes all overhead Execution phases: Parallelism Phase, Computation Phase, and Interaction Phase

Physical Machine Models: Parallel Vector Processor (PVP):

UMA, crossbar, shared memory Symmetric Multiprocessor (SMP):

UMA, crossbar or bus, shared memory, hard to scale Massively Parallel Processor (MPP):

NORMA, message passing, custom interconnection, “classic” supercomputers

Distributed Shared Memory (DSM): NUMA or NORMA, shared memory (hardware or software based), custom interconnections, possible cache directories

Cluster of Workstations (COW): NORMA, message passing, SSI challenged, commodity processors and interconnection

Basic Concept of Clustering Cluster Nodes Single-System-Image (SSI) Internode Connection Enhanced Availability Better Performance

Cluster benefits and difficulties Useability, availability, scalability, available utilization, and performance/cost ratio

Scalable Design Principles Independence Balanced Design Design for Scalability Latency Hiding



Chapter 2: Basics of Parallel Programming (p. 59-77)

Comparison of parallel and sequential programming

Programming Components and Considerations

Processes, Tasks, and Threads Process State and State Table Process Descriptor Process Context Execution Mode – kernel, user

Parallelism Issues Homogeneity in Processes Language Constructs Static versus Dynamic Parallelism Process Grouping Allocation Issues: DOP Degree of parallelism. Granularity Also called grain size.

Interaction/Communication Issues Communication Synchronization Aggregation

Data and Resource Dependence Flow dependence Anti-dependence Output dependence I/O dependence Unknown dependence

Bernstein Conditions ∅=∩ ji IO

∅=∩ ji OI

∅=∩ ji OO



Chapter 3: Performance Metrics and Benchmarks (p. 91-154)

Benchmarks have been defined to focus on specific machine characteristics Micro benchmarks: specific functions or attributes Macro benchmarks: functional programs representative of a class of applications

Performance of Parallel Computers Computations Parallelism and Interaction Overhead

Parallelism Overhead Process Management Grouping Operations (creation/destruction of groups) Process Inquiry Operations

Interaction Overhead Synchronization Communication Aggregation Broadcast, scatter, gather, total exchange

Performance Metrics Sequential Time, Parallel Time, Critical Path Time Speed, Speedup, Efficiency, Utilization Total Overhead

Scalability and Speedup Analysis Amdahl’s Law: Fixed Problem Size Gustafson’s Law Fixed Time Sun and Ni’s Law Memory/Resource Bounding Iso-performance Models



Chapter 4: Microprocessors as Building Blocks (p. 155-210)

Instruction Pipeline Design Issues: Pipeline cycle or processor cycle Instruction issue latency Cycles per instruction (CPI) Instruction issue rate Simple operations Complex operations Resource conflicts

Instruction Execution Ordering

From CISC to RISC and beyond Scalar Superscalar Superpipelined Superscalar-Superpipelined VLIW Multimedia Extensions

Future Microprocessors Multiway Superscalar Superspeculative Processor Simultaneous Multithreaded Processor Trace (multiscalar) Processor Vector IRAM Processor Single-chip Multiprocessors Raw (configurable) Processors



Chapter 5: Distributed Memory and Latency Tolerance (p. 211-272)

Memory Hierarchy Inclusion Property Coherence Contention

Locality of Reference Properties Temporal Spatial Sequential

Memory Planning Capacity Average Access Time

Cache Coherency Protocols Sources of incoherence: Write by different processors, process migration, I/O operations Cache Coherency Protocols Snoopy or Cache Directories

Snoopy Coherency Protocols Must be able to observe memory transfers write-update vs. write invalidate MESI

Shared Memory Consistency Memory event ordering

Memory Consistency Models Strict Consistency Sequential Consistency Processor Consistency Weak Consistency Release Consistency

Distributed Cache/Memory Architectures UMA, NUMA, COMA, NORMA SMP centralized memory architectures Others distributed memory architectures

Cache Coherence Considerations Cache Coherent – cc Non cache coherent – ncc Software Cache coherent - sc



Cache Directories

Latency Tolerance Techniques Latency avoidance, reduction, and hiding Distributed Coherent Caches Data Prefetching Relaxed Memory Consistency

Multithreaded Latency Hiding



Chapter 6: System Interconnections and Gigabit Networks (p. 273-342)

Basic Interconnecion Network Network Components Network Characteristics Network Properties

Network Topologies Node degree, network diameter, bisection width

Buses, Crossbar, and Multistage Interconnection Networks (MIN)

Gigabit Network Technology Ethernet ATM Scalable Coherence Interface (SCI)



Chapter 7: Threading, Synchronization, and Communication (p. 343-402)

Software Multithreading – the thread concept Threads, thread states and thread management Lightweight Process (LWP), LWP states and LWP management Heavyweight Process Kernel vs. User Level Processing

Synchronization Mechanisms Synch problems faced by users Language constructs employed by the user to solve the synch problem (high-level language constructs) Synch primitives available in multiprocessor architectures (low-level constructs) Algorithms used to implement high-level constructs with the low-level constructs available.

The TCP/IP Communication Protocol Suite OSI and Internet protocol stack Network Addressing TCP and UDP and IP

Fast and Efficient Communications Effective Bandwidth Network Interface Circuitry Software communication libraries



Chapter 8: Symmetric and CC-NUMA Multiprocessors (p. 407-452)

SMP and CC-NUMA Technology Availability: Bottleneck: Latency: Memory Bandwidth: I/O Bandwidth: Scalability: Programming Advantage:

Typical Applications – Commercial SMP Servers

Comparison of CC-NUMA Architectures Architecture: Shared Memory Access: Enhanced Scalability: Concerns:



Chapter 9: Support of Clusters and Availability (p. 453-504)

Challenges of Clustering Classification Attributes Dedicated Cluster Enterprise Cluster

Cluster Design Issues

Availability Support for Clustering Reliability Availability Serviceability Types of Failures

Availability Techniques Isolated Redundancy: Hot Standby, Mutual Takeover, and Fault-Tolerant Failover Recovery Schemes

Checkpointing and Failure Recovery Methods Overhead What to Checkpoint Consistent Snapshot

Support for Single System Image Single System (Application, Above Kernel, Kernel/Hardware) Single Control Use from any entry point Location Transparent

Job Management in Clusters Characteristics of Cluster Workload Job Scheduling Issues

scalable parallel computing

Documents

Transcript of scalable parallel computing