Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from...

43
Multiprocessing

Transcript of Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from...

Multiprocessing

Going Multi-core Helps Energy Efficiency• Power of typical integrated circuit C V2 f

– C = Capacitance, how well it “stores” a charge– V = Voltage– f = frequency. I.e., how fast clock is (e.g., 3 GHz)

William Holt, HOT Chips 2005

Adapted from UC Berkeley "The Beauty and Joy of Computing"

Processor Parallelism

• Process Parallelism : Ability run multiple instruction streams simultaneously

Flynn's Taxonomy

• Categorization of architectures based on– Number of simultaneous instructions– Number of simultaneous data items

Flynn's Taxonomy

• Categorization of architectures

SISD

• SISD : Single Instruction – Single Data– One instruction sent to one processing unit

to work on one piece ofdata

– May be pipelinedor superscalar

Flynn's Taxonomy

• Categorization of architectures

SIMD Roots

• ILLIAC IV– One instruction issued to

64 processing units

SIMD Roots• Cray I

– Vector processor– One instruction applied to all

elements of vector register

Modern SIMD

• x86 Processors– SSE Units : Streaming SIMD Execution– Operate on special 128 bit registers

• 4 32bit chunks• 2 64bit chunks• 16 8 bit chiunks• …

Modern SIMD

• Graphics Cardshttp://www.nvidia.com/object/fermi-architecture.html

• Becoming less and less "S"

Co Processors

• Graphics Processing : floating point specialized– i7 ~ 100 gigaflops– Kepler GPU ~ 1300 gigaflops

CUDA

• Compute Unified Device Architecture– Programming model for general purpose work on

GPU hardware– Streaming Multiprocessors each with 16-48 CUDA

cores

CUDA

• Designed for 1000's of threads– Broken into "warps" of 32 threads– Entire warp runs on SM in lock step– Branch divergence cuts

speed

Flynn's Taxonomy

• Categorization of architectures

MISD

• MISD : Multiple Instruction – Single Data– Different instruction, same data calculated– Rare– Space shuttle :

Five processors handlefly by wire input, vote

Flynn's Taxonomy

• Categorization of architectures

MIMD

• MIMD : Multiple Instruction – Multiple Data– Different instructions, working on different data

in different processing units

– Most common parallel

Coprocessors

• Coprocessor : Assists main CPU with some part of work

Co Processors

• Graphics Processing : floating point specialized– i7 ~ 100 gigaflops– Kepler GPU ~ 1300 gigaflops

Other Coprocessors

• CPU's used to have floating point coprocessors– Intel 30386 & 80387

• Audio cards• PhysX• Crytpo – SLL encryption for servers

Multiprocessing

• Multiprocessing : Many processors, shared memory– May have local cache/special memory

Homogenous Multicore

• i7 : Homogenous multicore– 4 chips in one – separate L2 cache, shared L3

Heterogeneous Multicore

• Different cores for different jobs– Specialized media processing

in mobile devices

• Examples– Tegra – PS3 Cell

Multiprocessing & Memory

• Memory conflict demo…

UMA

• Uniform Memory Access– Every processor sees every memory using same addresses– Same access time for any CPU to any memory word

NUMA

• Non Uniform Memory Access– Single memory address space visible to all CPUs– Some memory local

• Fast

– Some memory remote• Accessed in same way,

but slower

Connections

• Bus : One communication channel– Scales poorly

Connections

• Crossbar switched– Segmented memory– Any processor

can directly link toany memory

– N2 switches

Connections

• Other topologies– Balance complexity, flexibility and latency

BG/P Compute Cards

• 4 processors per card• Fully coherent caches• Connected in double

torus to neighbors

BG/P

• Full system : 72 x 32 x 32 torus of nodes

Titan

• The king : Descendant of Redstorm– http://www.olcf.ornl.gov/titan/

Flynn's Taxonomy

• Categorization of architectures

Distributed Systems

• No common memory space• Pass message between processors

COW

• Cluster of Workstations

Grid Computing

• Grid Computing– Multi Computing at internet scale– Resources owned by multiple parties

• http://folding.stanford.edu/ • Seti@Home

Parallel Algorithms

• Some problems highly parallel, others not:

• Applications can almost never be completely parallelized; some serial code remains

• Speedup always limited by serial part of program

Speedup Issues : Amdahl’s Law

Time

Number of Cores

Parallel portion

Serial portion

1 5

Speedup Issues : Amdahl’s Law

Time

Number of Cores

Parallel portion

Serial portion

1 2 3 4 5

• Amdahl’s law:– s is serial fraction of program,

P is # of processors

Ouch

• More processors only help with high % of parallelized code

Amdahl's Law is Optimistic

• Each new processor means more– Load balancing– Scheduling– Communication– Etc…