Chapter VIII Parallel Processor Organizations

40
PARALLEL PROCESSOR ORGANIZATIONS Jehan-François Pâris [email protected]

description

CA

Transcript of Chapter VIII Parallel Processor Organizations

  • PARALLEL PROCESSOR ORGANIZATIONSJehan-Franois [email protected]

  • Chapter OrganizationOverviewWriting parallel programsMultiprocessor OrganizationsHardware multithreadingAlphabet soup (SISD, SIMD, MIMD, )Roofline performance model

  • OVERVIEW

  • The hardware sideMany parallel processing solutionsMultiprocessor architecturesTwo or more microprocessor chipsMultiple architecturesMulticore architecturesSeveral processors on a single chip

  • The software sideTwo ways for software to exploit parallel processing capabilities of hardwareJob-level parallelismSeveral sequential processes run in parallelEasy to implement (OS does the job!)Process-level parallelismA single program runs on several processors at the same time

  • WRITING PARALLEL PROGRAMS

  • OverviewSome problems are embarrassingly parallelMany computer graphics tasksBrute force searches in cryptography or password guessingMuch more difficult for other applicationsCommunication overhead among sub-tasksAmdahl's lawBalancing the load

  • Amdahl's LawAssume a sequential process takestp seconds to perform operations that could be performed in parallelts seconds to perform purely sequential operationsThe maximum speedup will be(tp + ts )/ts

  • Balancing the loadMust ensure that workload is equally divided among all the processorsWorst case is when one of the processors does much more work than all others

  • Example (I)Computation partitioned among n processorsOne of them does 1/m of the work with m < nThat processor becomes a bottleneck

    Maximum expected speedup: n

    Actual maximum speedup: m

  • Example (II)Computation partitioned among 64 processorsOne of them does 1/8 of the work

    Maximum expected speedup: 64

    Actual maximum speedup: 8

  • A last issueHumans likes to address issues one after the orderWe have meeting agendasWe do not like to be interruptedWe write sequential programs

  • Rene DescartesSeventeenth-century French philosopherInventedCartesian coordinates Methodical doubt[To] never to accept anything for true which I did not clearly know to be such Proposed a scientific method based on four precepts

  • Method's third ruleThe third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.

  • MULTI PROCESSOR ORGANIZATIONS

  • Shared memory multiprocessorsInterconnection networkRAMI/O

  • Shared memory multiprocessorCan offerUniform memory access to all processors (UMA)Easiest to programNon-uniform memory access to all processors (NUMA)Can scale up to larger sizesOffer faster access to nearby memory

  • Computer clusters Interconnection network

  • Computer clustersVery easy to assembleCan take advantage of high-speed LANsGigabit Ethernet, Myrinet, Data exchanges must be done through message passing

  • Message passing (I)If processor P wants to access data in the main memory of processor Q it mustSend a request to QWait for a replyFor this to work, processor Q must have a threadWaiting for message from other processorsSending them replies

  • Message passing (II)In a shared memory architecture, each processor can directly access all data

    A proposed solutionDistributed shared memory offers to the users of a cluster the illusion of a single address space for their shared dataStill has performance issues

  • When things do not add upMemory capacity is very important for big computing applicationsIf the data can fit into main memory, the computation will run much faster

  • A problemA company replaced Single shared memory computer with 32GB of RAMFour clustered computers with 8GB eachMore I/O than everWhat did happen?

  • The explanationAssume OS occupies one GB of RAMThe old shared-memory computer still had 31 GB of free RAMEach of the clustered computer has 7 GB of free RAMThe total RAM available to the program went down from 31 GB to 47 = 28 GB!

  • Grid computingThe computers are distributed over a very large networkSometimes computer time is donatedVolunteer computingSeti@HomeWorks well with embarrassingly parallel workloadsSearches in a n-dimensional space

  • HARDWARE MULTITHREADING

  • General ideaLet the processor switch to another thread of computation while them current one is stalled

    Motivation:Increased cost of cache misses

  • ImplementationEntirely controlled by the hardwareUnlike multiprogrammingRequires a processor capable ofKeeping track of the state of each threadOne set of registersincluding PC for each concurrent threadQuickly switching among concurrent threads

  • ApproachesFine-grained multithreading:Switches between threads for each instructionProvides highest throughputsSlows down execution of individual threads

  • ApproachesCoarse-grained multithreadingSwitches between threads whenever a long stall is detectedEasier to implement Cannot eliminate all stalls

  • ApproachesSimultaneous multi-threading:Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threadsBest solution

  • ALPHABET SOUP

  • Overview Used to describe processor organizations whereSame instructions can be applied toMultiple data instancesEncountered inVector processors in the pastGraphic processing units (GPU)x86 multimedia extension

  • Classification SISD:Single instruction, single dataConventional uniprocessor architectureMIMD:Multiple instructions, multiple dataConventional multiprocessor architecture

  • Classification SIMD:Single instruction, multiple dataPerform same operations on a set of similar dataThink of adding two vectors

    for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];

  • Vector computingKind of SIMD architectureUsed by Cray computersPipelines multiple executions of single instruction with different data (vectors) trough the ALURequiresVector registers able to store multiple valuesSpecial vector instructions: say lv, addv,

  • BenchmarkingTwo factors to considerMemory bandwidthDepends on interconnection networkFloating-point performanceBest known benchmark is LINPACK

  • Roofline modelTakes into accountMemory bandwidthFloating-point performanceIntroduces arithmetic intensityTotal number of floating point operations in a program divided by total number of bytes transferred to main memoryMeasured in FLOPS/byte

  • Roofline modelAttainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance

  • Roofline modelPeak floating-point performanceFloating-point performance islimited by memory bandwidth

    Chart1

    2

    4

    8

    16

    16

    16

    16

    16

    GFLOPS

    Arithmetic Intensity

    Attainable GFLOPS/s

    Sheet1

    IntensityGFLOPS

    0.1252

    0.254

    0.58

    116

    216

    416

    816

    1616