The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

76
The University of Hertfordshire The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures Building Futures in Computer Science empowering people through technology Jason M c Guiness Computer Science University of Hertfordshire UK Colin Egan Computer Science University of Hertfordshire UK

description

ACCU Conference 2008 and London.

Transcript of The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Page 1: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

The University of Hertfordshire

The Challenges facing Libraries and Imperative Languages from

Massively Parallel Architectures

Building Futures in Computer Scienceempowering people through technology

Jason McGuinessComputer ScienceUniversity of HertfordshireUK

Colin EganComputer ScienceUniversity of HertfordshireUK

Page 2: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Parallel processing– Pipeline processors, MII architectures, Multiprocessors

• Processing In Memory (PIM)– Cellular Architectures: Cyclops/DIMES and picoChip

• Code-generation issues arising from massive parallelism• Possible solutions to this issue:

– Use the compiler or some libraries• An example implementation of a library, and the issues• Questions?

– Ask as we go along, but we’ll also leave time for questions at then end of this presentation

Presentation Structure

Page 3: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Parallel Processing

• How can parallel processing be achieved?– By exploiting:

• Instruction Level Parallelism (ILP)• Thread Level Parallelism (TLP)• Multi-processing• Data Level Parallelism (DLP)• Simultaneous Multi-Processing (SMP)• Concurrent processing• etcetera

Page 4: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Exploits ILP by overlapping instructions in different stages of execution:– ILP is the amount of operations in a computer program that

can be performed on at the same time (simultaneously)

• Improves overall program execution time by increasing throughput:– It does not improve individual instruction execution time

Pipelining

Page 5: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• A simple 5-stage pipeline

Pipelining

InstructionCache

Access;

Increment Program Counter;

Branch Prediction

InstructionDecode;

Register operand set up

DataCacheAccess

Write Back

Execution in ALU

Page 6: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Pipelining introduces hazards which can severely impact on processor performance:– Data (RAW, WAW and WAR)– Control (conditional branch instructions)– Structural (hardware contention)

• To overcome such hazards complex hardware (dynamic scheduling) or complex software (static scheduling) or a combination of both is required

Pipelining hazards

Page 7: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Data hazards– Read After Write (RAW dependency)

• The result of an operation is required as an operand of a subsequent operation and that result is not yet available in the register file

add $1,$2,$3 //$1=$2+$3addi $4,$1,8 //$4=$1+8 data dependency ($1 is unavailable)sub $5,$2,$1 //$5=$2-$1 data dependency ($1 is unavailable)ld $6,8($1) //$6=Mem[$1+8]... $1 may be availableadd $7,$6,$8 //load-use dependency...data dependency ($6 is

unavailable)

Pipelining hazardsData, Control, Structural

Page 8: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Data hazards– Write After Write (WAW data dependency)

sub $1,$2,$3 addi $1,$4,8 //the addi cannot complete ahead of the sub

– Write After Read (WAR data dependency)sub $2,$1,$3 addi $1,$4,8 //the addi cannot complete ahead of the sub

Pipelining hazardsData, Control, Structural

Page 9: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Data hazards– Read After Write (RAW data dependency) is a

problem in single instruction issue architectures and Multiple Instruction Issue (MII) architectures

– Write After Write (WAW data dependency) and Write After Read (WAR data dependency) are a problem MII architectures, not single instruction issue architectures

Pipelining hazardsData, Control, Structural

Page 10: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Data hazards– Read After Write (RAW data dependency) are True Data

Dependencies• RAW are avoided:

– in single instruction issue architectures by forwarding (bypassing),

– in MII:

» Superscalar (dynamic) by result broadcast

» VLIW (static) by aggressive instruction (percolation) scheduling

– Write After Write (WAW data dependency) and Write After Read (WAR data dependency) are Artificial Data Dependencies

• WAW/WAR are eliminated by Register Renaming

Pipelining hazardsData, Control, Structural

Page 11: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Control hazards– Are caused by (conditional) branch instructions– The only instruction which can alter the smooth flow of

instruction execution– The next instruction to be executed is not known until the

condition evaluation has been determined and in the case that the condition is to be taken then the address of the next instruction to be executed needs to be computed:

• branch resolutionbeq $1,$2,label //if $1==$2 then pc=label; else pc=pc+4;bnq $1,$2,label //if $1!=$2 then pc=label; else pc=pc+4;

Pipelining hazardsData, Control, Structural

Page 12: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Control hazards– Are a major problem:

• Between 1 in 5 and 1 in 8 instructions from the dynamic stream of a general purpose program are conditional branch instructions

• Approximately 66% of conditional branches are taken (target path)• Are a particular problem in Superscalar Architectures where 100s of

instructions may be in flight

– The impact can be reduced by:• Branch prediction (dynamic and static)• Branch removal (scheduling - static)• Delay region scheduling (static)• Multithreading (dynamic and static)

Pipelining hazardsData, Control, Structural

Page 13: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Structural hazards– More than one instruction attempting to access the

same piece of hardware at the same time

– These can be overcome by increasing the amount of hardware:

• For example, use the Harvard Architecture (separate the I-cache from the D-cache)

Pipelining hazardsData, Control, Structural

Page 14: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• MII– A processor that is capable of fetching and issuing

more than one instruction during each processor cycle– A program is executed in parallel, but the processor

maintains the outward appearance of sequential execution

– The program binary must therefore be regarded as a specification of what was done, not how it was done

– Minimise program execution time by:• by reducing instruction latencies• by exploiting additional ILP

Multiple Instruction Issue

Page 15: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Superscalar Processor:– An MII processor where the number of instructions issued

in each clock cycle is determined dynamically by hardware at run-time

• Instruction issue may be in-order or out-of-order (Tomasulo or equivalent).

• VLIW Processor (Very Long Instruction Word):– An MII processor where the number of operations

(instructions) issued in each clock cycle is fixed and where the operations are selected statically by the compiler at compile time

Multiple Instruction Issue

Page 16: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Fetch and Decode multiple instructions in parallel

• Dynamically predict conditional branches

• Initiate instructions for parallel execution based, whenever possible, on the availability of operands

• Frequently issue/execute instructions out-of-order

• Re-sequence instructions into the original program order after completion

Superscalars

Page 17: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Superscalars

1. Instruction fetch strategies Multiple instruction fetch Dynamic branch prediction

2. Detecting True Data Dependencies between registers3. Initiating or issuing multiple instructions in parallel4. Resources for parallel execution of instructions:

Multiple pipelined functional units Efficient memory hierarchies

5. Communication of data values through memory: Loads and stores (RISC) Allowing for dynamic, unpredictable behaviour of memory hierarchies

6. Committing the process state in the correct order: Maintaining the appearance of sequential instruction execution

Page 18: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Superscalars

• Since there is very little ILP within individual basic blocks (i.e. between branches), branch prediction is essential for high performance

• It is important to maintain an instruction window or window of execution from which ILP can be extracted

Page 19: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Superscalars

• The problems include:– Conditional branches– Data path complexity– The memory wall (widening gap between CPU and

memory access times)– Limited available parallelism within a finite sized

instruction window

• Is the superscalar approach at a point of diminishing returns?

Page 20: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Attempts to reduce hardware complexity by extracting ILP at compile time instead of run time

• The processor issues a single very long instruction during each machine cycle

• Each very long instruction consists of a large number of independent operations packed into a single instruction word by the compiler

• Each operation requires a small, statically predictable number of cycles to execute

• Each operation is itself pipelined• Execute instructions in-order

VLIWs

Page 21: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Problems:– Code expansion:

• The need to pad out instruction groups with NOPs can double the code size

– Compatibility:• Code must be recompiled (scheduled) for every new instance

of a particular VLIW architecture

VLIWs

Page 22: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• To sustain multiple instruction fetch, MII architectures require a complex memory hierarchy:– Caches

• l1, l2, stream buffers, non-blocking caches– Virtual Memory

• TLB

– Caches suffer from:• Compulsory misses• Capacity misses• Collisions

Problems of MII

Page 23: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Compulsory misses:– The first time a processor address is requested it will not be

in cache memory and must be fetched from a slower level of the memory hierarchy:

• Hopefully main (physical) memory• If not from Virtual Memory• If not from secondary storage

– This can result in long delay (latency) due to large access time(s)

Memory Problems

Page 24: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Capacity misses:– There are more cache block requests than the size of the

cache

• Collisions:– The processor makes a request to the same block but for

different instructions/data

• For both:– Blocks therefore have to be replaced– But a block that has been replaced might be referenced

again resulting in yet more replacements

Memory Problems

Page 25: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• A thread can be considered to be a ‘light weight process’– Where a thread consists of a short sequence of code, with

its own: • registers, data, state and so on• but shares process space

• TLP is exploited by simultaneously executing different threads on different processors:– TLP is therefore exploited by multiprocessors

Thread Level Parallelism

Page 26: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Multiprocessors

• Should be:– Easily scalable– Fault tolerant– Achieve higher performance than a uni-processor

• But …– How many processors can we connect?– How do parallel processors share data?– How are parallel processors co-ordinated?

Page 27: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Multiprocessors

• Shared Memory Processors (SMP)– All processors share a single global memory address space

– Communication is through shared variables in memory

– Synchronisation is via locks (hardware) or semaphores (software)

Page 28: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Multiprocessors

• Uniform Memory Access (UMA)– All memory accesses take the same time– Do not scale well

• Non-uniform Memory Access (NUMA)– Each processor has a private (local) memory– Global memory access time can vary from processor to

processor– Present more programming challenges– Are easier to scale

Page 29: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Multiprocessors

• NUMA– Communication and synchronization are achieved

through message passing:• Processors could then, for example, communicate over

an interconnection network• Processors use send and receive primitives

Page 30: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Multiprocessors

• The difficulty is in writing effective parallel programs:– Parallel programs are inherently harder to develop– Programmers need to understand the underlying hardware– Programs tend not to be portable– Amdahl’s law; a very small part of a program that is inherently

sequential can severely limit the attainable speedup

• “It remains to be seen how many important applications will run on multiprocessors via parallel processing.”

• “The difficulty has been that too few important application programs have been written to complete tasks sooner on multiprocessors.”

Page 31: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Multiprocessors

• Multiprocessors suffer the same memory problems as uni-processors and in addition:– The problem of maintaining memory coherence

between the processors

Page 32: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• A read operation must return the value of the latest write operation

• But in multiprocessors each processor will (probably) have its own private cache memory– There is no guarantee of data consistency between

private (local) cache memory and shared (global) memory

Cache Coherence

Page 33: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Today’s processors are 10 times faster than processors of only 5 years ago

• But today’s processors do not perform at 1/10th of the time of

processors from 5 years ago

• CPU speeds double approximately every eighteen months (Moore's Law), while main memory speeds double only about every ten years– This causes a bottle-neck in the memory subsystem, which impacts on

overall system performance– The “memory wall” or the “Von Neumann bottleneck”

The memory wall

Page 34: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• In today’s processors RAM roughly holds 1% of the data stored on a hard disk drive

• And hard disk drives supply data nearly a thousand times slower than a processor can use it

• This means that for many data requests, a processor is idle while it waits waiting for data to be found and transferred

The memory wall

Page 35: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• IBM’s vice president Mark Dean has said:– "What's needed today is not so much the ability to

process each piece of data a great deal, it's the ability to swiftly sort through a huge amount of data."

The memory wall

Page 36: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Mark Dean again:– "The key idea is that instead of focusing on

processor micro-architecture and structure, as in the past, we optimize the memory system's latency and throughput — how fast we can access, search and move data …"

• Leads to the concept of Processing In Memory (PIM)

The memory wall

Page 37: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• The idea of PIM is to overcome the bottleneck between the processor and main memory by combining a processor and memory on a single chip

• The benefits of a PIM architecture are:– Reduced memory latency– Increases memory bandwidth– Simplifies the memory hierarchy– Provides multi-processor scaling capabilities:

• Cellular architectures

– Avoids the Von Neumann bottleneck

Processing in memory

Page 38: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• This means that:– Much of the expensive memory hierarchy can be

dispensed with– CPU cores can be replaced with simpler designs– Less power is used by PIM– Less silicon space is used by PIM

Processing in memory

Page 39: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• But …– Processor speed is reduced– The amount of available memory is reduced

• However, PIM is easily scaled:– Multiple PIM chips connected together forming a

network of PIM cells

– Such scaled architectures are called Cellular architectures

Processing in memory

Page 40: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Cellular architectures consist of a high number of cells (PIM units):– With tens of thousands up to one million processors

– Each cell (PIM) is small enough to achieve extremely large-scale parallel operations

– To minimise communication time between cells, each cell is only connected to its neighbours

Cellular architectures

Page 41: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Cellular architectures are fault tolerant:– With so many cells, it is inevitable that some processors

will fail– Cellular architecture simply re-route instructions and data

around failed cells

• Cellular architectures are ranked highly as today’s Supercomputers:– IBM BlueGene takes the top slots in the Top 500 list

Cellular architectures

Page 42: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Cellular architectures are threaded:– Each thread unit:

• Is independent of all other thread units• Serves as a single in-order issue processor• Shares computationally expensive hardware such as floating-point

units

– There can be a large number of thread units:• 1,000s if not 100,000s of thousands• Therefore they are massively parallel architectures

Cellular architectures

Page 43: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Cellular architectures are NUMA– Have irregular memory access:

• Some memory is very close to the thread units and is extremely fast

• Some is off-chip and slow

• Cellular architectures, therefore, use caches and have a memory hierarchy

Cellular architectures

Page 44: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• In Cellular architectures multiple thread units perform memory accesses independently

• This means that the memory subsystem of Cellular architectures do in fact require some form of memory access model that permits memory accesses to be effectively served

Cellular architectures

Page 45: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Uses of Cellular architectures:– Games machines (simple Cellular architecture)– Bioinformatics (protein folding)– Imaging

• Satellite• Medical• Etcetera

– Research– Etcetera

Cellular architectures

Page 46: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Examples:– BlueGene Project:

• Cyclops (IBM) – next generation from BlueGene/P, called BlueGene/C

– DIMES – a prototype implementation

– Gilgamesh (NASA)– Shamrock (Notre Dame)– picoChip (Bath, UK)

Cellular architectures

Page 47: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Developed by IBM at the Tom Watson Research Center

• Also called BlueGene/C in comparison with the earlier version of BlueGene/L and BlueGene/P

IBM Cyclops or BlueGene/C

Page 48: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• The idea of Cyclops is to provide around one million processors:

– Where each processor can perform a billion operations per second

– Which means that Cyclops will be capable of one petaflop of computations per second (a thousand trillion calculations per second)

Cyclops

Page 49: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

CyclopsProcessor

ChipBoard

Crossbar Network

TU TU TU…

SP SP SP

FPU SR

TU TU TU…

SP SP SP

FPU SR

TU TU TU…

SP SP SP

FPU SR

…M

EM

OR

YB

AN

K

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

6 * 4 GB/sec

4 GB/sec

50 MB/sec

1 Gbits/sec

Off-

Chi

p M

emor

y

OtherChips via 3D mesh

Off-

Chi

p M

emor

yO

ff-C

hip

Mem

ory

Off-

Chi

p M

emor

y

IDEHDD

4 GB/sec

6 * 4 GB/sec

SP SP SP SP SP SP SP SP

Page 50: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• DIMES:– Is the first hardware implementation of a Cellular

architecture– Is a simplified ‘cut-down’ version of Cyclops– Is hardware validation tool for Cellular architectures– Emulates Cellular architectures, in particular Cyclops,

cycle-by-cycle – Is implemented on at least one FPGA– Has been evaluated by Jason

DIMES

Page 51: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• The DIMES implementation that Jason evaluated:– Supports a P-thread programming model– Is a dual processor where each processor has four thread

units– Has 4K of scratch-pad (local) memory per thread unit– Has two banks of 64K global shared memory– Has different memory models:

• Scratch pad memory obeys the program consistency model for all of the eight thread units

• Global memory obeys the sequential consistency model for all of the eight thread units

– Is called DIMES/P2

DIMES

Page 52: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

DIMES/P2

Page 53: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Jason’s concerns were:– How to manage a potentially large number of

threads

– How to exploit parallelism from the input source code in these threads

– How to manage memory consistency

DIMES

Page 54: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

• Jason tested his concerns by using an “embarrassingly parallel program which generated Mandelbrot sets”

• Jason’s approach was to distribute the work-load between threads and he also implemented a work-stealing algorithm to balance loads between threads:– When a thread completed its ‘work-load’, rather than

remain idle that thread would ‘steal-work’ from another ‘busy’ thread

– This meant that he maximised parallelism and improved thread performance and hence overall program execution time

DIMES

Page 55: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

DIMESShortly after program start Shortly before work stealing

Just after work stealing More work stealing

Page 56: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

picoChip• picoChip are based in Bath

– “… is dedicated to providing fast, flexible wireless solutions for next generation telecommunications systems.”

• picoArrayTM

– Is a tiled architecture– 308 heterogeneous processors connected together,– The interconnects consist of bus switches joined

by picoBusTM

– Each processor is connected to the picoBusTM above and below it

Page 57: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

picoChip

• picoArrayTM

Page 58: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

picoChip and Parallelism

• The parallelism provided by picoChip is synchronous – This avoids many of the issues raised by the other

architectures that expose asynchronous parallelism

– But it is at the cost of the flexibility that asynchronous parallelism provides

Page 59: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Issues arising ...

• Such massively parallel hardware creates a serious problem for the programmer:– Because it is computer science folklore that writing

parallel programs is very hard

• Writing fast and correct massively parallel programs is a very major concern

Page 60: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Abstraction of the Parallelism

• This may be done in various ways:– For example within the compiler:

• Using trace scheduling, list-based scheduling or other data-flow based means amongst others

– Using language features:• HPF, UPC or the additions to C++ in the IBM Visual

Age compiler and Microsoft's additions

– Most commonly using libraries:• For example: Posix threads, Win32 threads, OpenMP,

boost.thread, home-grown wrappers

Page 61: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Parallelism using Libraries

• Using libraries has a major advantage over implementing parallelism within the language:– It does not require the design of a new language,

nor learning a new language– Novel languages are traditionally seen as a burden

and often hinder adoption of new systems due to the volume of existing source-code

– But libraries are especially prone to mis-use and are traditionally hard to use

Page 62: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Issues of Libraries: part I• The model exposed is very diverse:

– Message passing, e.g. OpenMP– Exposes loop-level parallelism (e.g. “forall ...”

constructs) exposes very limited parallelism in general-purpose code

– Is very low-level, e.g. the primitives exposed in Posix Threads are extremely basic: mutexes, condition variables, and basic thread operations

• The libraries require experience to use well, or even correctly

Page 63: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Part II: Non-composability of atomic operations!

• The fact that atomic operations do not compose is a major concern when using libraries!

• The composition of thread-safe data structures does not guarantee thread-safety:– At each combination of the data structures, more locks are

required to ensure correct operation!– This implies more and more layers of locks, that are

slow...

Page 64: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Parallelism in the Compiler

• Given the concerns with regards to libraries, what about parallelising compilers?

• The fact is that auto parallelising compilers exist, e.g. list-based scheduling implemented in the EARTH-C (circa 1999) has been proven to be optimal for that architecture

• Data-flow compilers have existed for years– Why aren't they used?

Page 65: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Industrial Parallelising Compilers

• Microsoft is introducing OpenMP-based constructs into their compiler, e.g. “forall”

• IBM Visual Age has similar functionality

• Java has a thread library

• C++0x: much work has been done regarding standardisation with respect to multiple threads

Page 66: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

C++ as an example

• The uses of C++ makes it an interesting target for parallelisation:– Although imperative, so arguably flawed for

implementing parallelism– It has great market penetration, therefore there is

much demand for using parallelism– Commonly used in high-performance, multi-

threaded systems– General-purpose nature and quality libraries are

increasing the appeal to super-computers

Page 67: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Parallelism support in C++

• Libraries exist beyond the usual C libraries:– boost.thread – exists now, requires standards-

compliant compilers– C++0x: details of the threading support are

becoming apparent that appear to include:• Atomic operations (memory consistency), exceptions,

machine-model underpins approach• Threading models: thread-as-a-class, lock objects• Thread pools – coming later – probably• More details on the web, or at the ACCU - in flux

Page 68: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Experience using C++• Recall DIMES:

– Prototype of massively parallel hardware– Posix-threads style library implementing threads– C++ thread-as-a-class wrapper implemented

• Summary of experience:– Hardly object-orientated: no separation in design of the

Mandlebrot application with the work-stealing algorithm and thread pool

– The software insufficiently separated the details of the hardware features from the design

Page 69: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Further experiences using C++

• From this work and other experiences, I developed a more interesting thread library:– Traits abstract underlying OS thread API from

wrapper library– Therefore has hardware abstractions too– Provision of higher-level threading models:

• Primarily based on futures and thread pools

– Use of thread pools and futures creates a singly rooted-tree of threads:• Trivially deadlock free – a holy grail!

Page 70: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

C++0x threads now? No!• Included in libjmmcg:

– Relies upon non-standard behaviour and broken optimisers! For example:• Problems with global code-motion moving apparently const-objects past

locks• Exception stack is unique to a thread, not global to program and currently

unspecified• Implementation of std::list

– DSEL has syntax limitations due to design of C++– Doesn't use boost ...– Has example code and test cases– Isn't complete! (e.g. Posix & sequential specialisations

incomplete, some inefficiencies)– But get it from libjmmcg.sourceforge.net, under LGPL

Page 71: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Trivial example usagestruct res_t { int i; };

struct work_type {typedef res_t result_type;void process(result_type &) {}

};

pool_type pool(2);

async_work work(creator_t::it(work_type(1)));

execution_context context(pool<<joinable()<<time_critical()<<work);

pool.erase(context);

context->i;

• The devil is in the omitted details: the typedefs for:– pool_type, async_work, execution_context, joinable, time_critical• The library requires that the work to be mutated has the

items in italics defined

Page 72: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Explanation of the example

• The concept is:– that asynchronous work (async_work) that should be

mutated (process) to the specified output type (result_type) is transferred into a thread pool (pool_type) of some kind

• This transfer (<<) may, optionally (joinable), return a future (execution_context)– Which can be used to communicate (->) the result of the

mutation, executed at kernel priority (time_critical), back to the caller

• The future also allows exceptions to be propagated

Page 73: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

More details regarding the example

• The thread pool (pool_type) has many traits:– Master-slave or work-stealing– The thread API (Win32, Posix or sequential)– The thread API costs in very rough terms– Implies a work schedule that is a fifo baker's ticket schedule,

implementation of GSS(k) is in progress

• The library effectively implements a software simulation of data-flow

• Wrapping a function call, plus parameters, in a class converts Kevlin's threading model to this one

Page 74: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Time for controversy....What faces programmers...

• Large-scale parallelism is here, now:– Blade frames at work:

• 4 cores x 4 CPUs x 20 frames per rack = 320 thread units, in a NUMA architecture

• The hardware is expensive!• But so is the software ...

– It must be programmed, economically– The programs must be maintained ...

• Or it will be an expensive failure?

Page 75: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

Talk summary• In this talk we have looked at parallel processing and

justified the reasons for Processing In Memory (PIM) and therefore cellular architectures

• We have briefly looked at two example architectures:

– Cyclops and picoChip

• Jason has worked on DIMES, the first implementation of a (cut-down) version of a cellular architecture

• The issues of programming for these massively parallel architectures has been described

• We focussed on the future of threading in C++

Page 76: The Challenges facing Libraries and Imperative Languages from Massively Parallel Architectures

The University of Hertfordshire

The Challenges facing Libraries and Imperative Languages from

Massively Parallel Architectures

Questions?

Building Futures in Computer Scienceempowering people through technology

Jason McGuinessComputer ScienceUniversity of HertfordshireUK

Colin EganComputer ScienceUniversity of HertfordshireUK