Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR...

download Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR PROGRAMMING» LECTURE BY: SAMUEL AMAR.

If you can't read please download the document

description

HOW DO WE MEASURE PERFORMANCE? Latency - The time for an individual method call to complete. Throughput - the overall rate of method calls complete

Transcript of Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR...

Counting and Distributed Coordination BASED ON CHAPTER 12 IN THE ART OF MULTIPROCESSOR PROGRAMMING LECTURE BY: SAMUEL AMAR INTRO What we will discuss today: Shared counting problem Data structures: Combining Trees Counting Networks Diffraction Trees HOW DO WE MEASURE PERFORMANCE? Latency - The time for an individual method call to complete. Throughput - the overall rate of method calls complete EXAMPLE POOLS A data structure containing Put() and Get() Methodes. Problem A bottleneck for Get() and Put() lock. Solution A Cyclic array and 2 counters! YET MORE PROBLEMS How do we prevent memory contention? How do we parallelize the counter++? We need a way to build parallel counters that can spread the indexes the best way possible! COMBINING TREE A combining tree is a binary tree of nodes The counter is in the root Each thread is assigned a leaf At most two threads share a leaf ALGORITHM OVERVIEW For each thread that calls GetAndIncrement(): Go up the tree until the root, do count++ and return count. If two threads arrived to a node simultaneously: The Active thread will go up the tree and update the Counter with the combine value. The Passive thread will wait for the active to come back. count B BA EXAMPLE 3 BA 3 A,B BA EXAMPLE 3 B BA A 5 B BA A B 5 B BA A B A=4 B=3 EXAMPLE 5 BA B B=3 A A=4 EXAMPLE 5 BA A=4 B=3 ADVANTAGES AND DISADVANTAGES Advantages Good throughput: O(p) for locking queue and O(logP) for combining tree. Can be used for any function on the root. Disadvantage Bad Latency: O(1) for locking queue and O(logP) for combining tree. NODE IMPLEMENTATION parent Cstatus result first second locked NODE IMPLEMENTATION 6 properties: Locked set to true if the node is locked. FirstValue The value of the Active thread. SecondValue The value of the Passive thread. Result Final combined value. Parent Nodes parent node pointer Cstatus CSTATUS IDLE : the node is not in use. FIRST : one thread visited. SECOND : a second thread visited. RESULT : both threads operations have completed. ROOT : root node. ADVANCED EXAMPLE - INIT Cstatusresult firstsecond locked R3 Un-locked I I I I I I Cstatusresult firstsecond locked R3 Un-locked I I I I I I A F F B F S Locked ADVANCED EXAMPLE - PRECOMBINING Cstatusresult firstsecond locked R3 Un-locked S Locked F Un-locked I F I I B A C S Locked D F Un-locked F ADVANCED EXAMPLE - COMBINING Cstatusresult firstsecond locked R3 Un-locked S Locked S F Un-locked F F I B A C D F Locked S 1 Un-locked S 11 2 S ADVANCED EXAMPLE - COMBINING Cstatusresult firstsecond locked R3 Un-locked S 2 S 11 F F Locked F Un-locked I B C D A F Locked F R4 Un-locked E F S 12 3 R7 I I ADVANCED EXAMPLE - DISTRIBUTION Cstatusresult firstsecond locked R7 Un-locked S 12 S 11 F Locked F F F Un-locked D A B C E Return 3 I Un-locked I I ADVANCED EXAMPLE - DISTRIBUTION Cstatusresult firstsecond locked R7 Un-locked S 12 S 11 F Locked F Un-locked A B C E R5 12 Return 4 I5 12 Un-locked R6 11 Return 5 Return 6 D Return 3 I6 11 Un-locked ALL THAT FOR COUNT++ ?! PERFORMANCE REVIEW Optimal when threads arrive at the correct time to the leafs, and maximize combining. What happens when contention is low? How log do we wait for another thread to come and combine? ROBUSTNESS An algorithm is robust, if it performs well in the presence of large fluctuations in request arrival times. Is the Combining Tree a robust algorithm? NO! MOTIVATION We need an algorithm that can count amount of tokens with no consideration to arrival time or order. INDEX DISTRIBUTION For a set of incoming tokens and W shared counters. How would we like to distribute them among exits? ? i*w + 1 i*w + 2 i*w + 4 THE STEP PROPERTY No matter how token arrivals are distributed among the input wires, the output distribution is balanced across the output wires, where the top output wires are filled first A network with this property balances the tokens perfectly BALANCER A component with 2 entries and 2 exits. Contains a toggle button that shows up and down. Every token goes to the exit according to the toggle and changes it. The balancer fulfills the step property for w=2. COUNTING NETWORK A counting network of width k fulfills: Constructed only by balancers k input and output lines Step property COUNTING NETWORK EXAMPLE BITONIC[2K] COUNTING NETWORK A kind of counting network with depth 2K. Defined inductively, for any K = power of 2: K=2: A single balancer K > 2 : merge 2 Bitonic[K] networks To a Merger[2K] network MERGER[2K] Used to merge 2 Bitonic[k] Networks. Defined inductively, for any K = power of 2: K=2: A single balancer. K > 2 : Merge Odd and even outputs of 2 Merger[K] through k balancers. Bitonic Fulfils the step property! BITONIC NETWORK IMPLEMENTATION Simple enough implementation where the tokens are the threads. Balancer contains a simple toggle switch with 4 pointers (2 entries and 2 exits). BITONIC NETWORK IMPLEMENTATION Merger contains a double array with lower order mergers and an array of the current balancers layer. BITONIC NETWORK IMPLEMENTATION Bitonic contains a double array with lower order Bitonics and a larger merger. All classes contain a Travers(i) method. BITONIC NETWORK DEPTH PERFORMANCE REVIEW Optimal throughput when #threads #balancers, and all balancers are occupied. Performance improves the more threads there are until it plateaus and descends. The Network is wait-free or lock-free according to the balancers implementation. But Is this really a counting network? PERIODIC COUNTING NETWORK BLOCK[2K] Defined inductively, for any K = power of 2: K=2: A single balancer K > 2 : Merge Corresponding outputs through k balancers. Fulfils the step property! DIFFERENCE MOTIVATION DIFFRACTION BALANCER Lets consider a new type of balancer with only one input. This balancer will work the same way and will send a token to wires 0 and 1 alternatively. TREE[2K] A Binary Tre define as follows: Inductively, for any K = power of 2: K=2: A single diffraction balancer. K > 2 : Merge 2 Tree[K] with one new Root Diffraction Balancer. Top Tree becomes even outputs and bottom tree becomes odd outputs. STEP PROPERTY But Do we fulfil the step property? YES! PARTIAL PROOF We will prove inductively that outputs are filled from top to bottom mod(w): For k=2: Diffraction Balancer. For K>2: We Assume each Tree[k] has the step property. The outputs are a perfect shuffle of the Tree[k]s. REVIEW SO FAR Advantages Now depth is only O(log(K))! Disadvantages Bottleneck on the Root node.. ATTEMPTED SOLUTION even Observation if an even number of tokens pass through a balancer, the outputs are evenly balanced on the top and bottom wires, but the balancer's state remains unchanged. How can we use this to our advantage? EXCHANGER A data structures that allows T threads to exchange values. Contains a timeout. PRISM Basically, an array of Exchangers Can only access the array randomly using the visit() method. Visit()- returns a Boolean value according to the exchange that was made or a TimeOutException. PRISMS Each thread calls visit() and proposes its ThreadID. If an exchange was made, the higher thread goes to the top wire. Else, the thread goes back to toggle the Balancer. PERFORMANCE REVIEW Depends on two major factors: Timeout Small = Misses, Big = Time waste. Prisms size Small = Missed Opportunities. Big = Misses. What Are The best Parameters? Set Them Dynamically according to contention! Under optimal parameters, Diffraction Trees are believed to be better than Counting Networks and Combining Trees.