Parallel Sortingliacs.leidenuniv.nl/~wijshoffhag/PPI2017_2018/Lecture_10.pdf · (Sequential)...

ParallelSorting

Ajungle

•

Illustration

https://www.youtube.com/watch?v=kPRA0W1kECg

(Sequential)Sorting

• BubbleSort,InsertionSort– O(n2)

• MergeSort,HeapSort,QuickSort– O(nlogn)– QuickSort bestonaverage

• Optimal Parallel Timecomplexity– O(nlogn)/P– IfP=NthenO(logn)

InsertionSortInsertion_Sort (A)

for i from 1 to |A| - 1j = iwhile j > 0 and A[j-1] > A[j]

swap A[j] and A[j-1]j = j – 1

Return ( A )

Inherentlysequentialsohardtoparallelize!!!!è Onlythroughpipelining canspeedupberealized

PipelinedInsertionSort•

Tpipelined =2n,withnprocessors,somaximalspeedup=n/4– 3/4(wortcasesequentialtime=(n-1)(n-2)/2=n2/2-3n/2+2/2)

ParallelMergeSortMerge_Sort (A)

n = |A|halfway = floor(n/2)

DOINPARALLELMerge_Sort (A[1]… A[halfway])Merge_Sort (A[halfway+1]… A[n])

j = 1; current = 1for i from 1 to halfway

while j ≤ n-halfway and A[halfway + j] < A[i]X[current] = A[halfway + j]j = j + 1; current = current+1

X[current] = A[i]current = current+1

Return ( X )

halfway halfway + j ni

A

Inapicture

•

NotesMergeSort

• Collectssortedlistontooneprocessor,mergingasitemscometogether

• Mapswelltotree structure,sortinglocallyonleaves,thenmergingupthetree

• Asitemsapproachrootoftree,processorsaredropped,limitingparallelism

• O(n),ifP=n(1+2+4+…+n/2+n)=n(1+1/2+1/4…)=n.2

ParallelQuickSortQuickSort (A)

if |A| == 1 then return Ai = rand_int (|A|)p = A[i]DOINPARALLEL

L = QuickSort({a A|a < p})E = {a A|a = p}G = QuickSort({a A|a > p})

Return ( L || E || G )

∈

∈

∈

∈

IfweassumethatthepivotsarechosensuchthatLandGareaboutequalinsize,then

Sequential:T(n)=2T(n/2)+O(n)=O(nlogn)Infactitcanbeproventhatthisalwaysholds!

Forparallelexecution thechoiceofi iscrucialforloadbalance.Evenmoreimportantlywewouldliketochoosemultiplepivots(p-1)atthesametime,sothateachtimewegetppartitions whichcanbeexecutedinparallel.

Ppartitions• Foragivenp(numberofpivots)ands(oversamplingrate),firstselectatrandomp*scandidatepivots

for i from 1 to p*s

Cand[i] = rand_int (|A|)

• Sort thelistofcandidatepivots:Cand[i]• ChooseCand[s],Cand[2*s]…Cand[(p-1)*s]Findagoodvaluefortheoversamplingrate:s>1,

è sshouldnotleadtoverylongsortingtimes

ParallelRadixSortInsteadofcomparingvalues:COMPAREDIGITS

Radix_Sort (A, b) # Assumebinaryrepresentationsofkeysfor i from 0 to b-1

FLAGS = { (a>>i) mod 2 | a A } NOTFLAGS = { 1-FLAGS[a] | a A }R_0 = SCAN (NOTFLAGS)s_0 = SUM (NOTFLAGS)R_1 = SCAN (FLAGS)R = {if FLAGS[j] == 0

then R_0[j]else R_1[j] + s_0| j [0…|A|-1}

A = A sorted by RReturn ( A )

∈∈

∈

(a>>i) mod 2: rightshift i times,soe.g.01101>>2 mod2 =00011 mod 2 = 1

So(a>>i) mod 2 equalsthe(i+1)th rightmostbitofa

LSD/MSDRadixSort

Insteadof(a>>i) mod 2

onecanalsoimplementsRadixSortwith:(a<<i) div 2^(b-1)

ThefirstimplementationiscalledleastsignificantdigitRadixSortorLSDRadixSortThelatteronisMSDRadixSort

NotesRadixSort

ØSequentialtimecomplexity:T(n)=O(b.n),

biterations,eachiterationO(n)ØNotethatb≈logn,soatotalofO(nlogn)ØInsteadofsingledigitsablockofrdigitscanbetakeneachtime,resultinginb/r iterations

Illustration(LSDRadixSort)

•

SortingofeachselecteddigitinRadixSort,withPrefixSumBasedSorting

Eachelementi oftheprefixsumarrayhastheSUMofallelementswhichindexissmallerthani

Whatistherelationshipwithsorting?

•

ØAllbitswhichareequalto0areflaggedwitha1ØComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]inthelocationindicatedbytheprefixsum

Secondstage

•

ØAllbitswhichareequalto1 areflaggedwitha1ØComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]inthenextlocationsindicatedbytheprefixsum

Whataboutparallelexecution?

• Computationallythesortingalgorithmisreducedtocomputingtheprefixsumarraysforeachbitranking.

• However,computingtheseprefixsumarraysseemstobeinherentlysequential.Ornot?

ParallelExecutionofPrefixSums

Prefix_Sum (X) # X a n-bit array

for index from 0 to log nDOINPARALLELforallkif k >= 2^index thenX[k] = X[k]+X[k-2^index]

X >> 1 #Shift all entries to the rightReturn ( X )

IllustrationofparallelPrefixSums

•

ImprovingCachePerformanceØ Theparallelprefixsumalgorithmrequiresthewholearraytobe

fetchedateachiterationØ BadcacheperformanceØ ThroughTilingTechniquestheXarraycanbecutintoslices(tiles)Ø Onceeverynumberofiterationsre-tile!!Ø ACUDAimplementationoftheoverallalg.canbefoundon

https://github.com/debdattabasu/amp-radix-sort

2index

X

P2

P1

P3

Bitonic Sorting

Basedonbitonic sequences:

A[1],A[2],….,A[n-1],A[n]isbitonic,iffthereisaj andk suchthat• A[1]…A[j]ismonotonicincreasing,• A[j]…A[k]ismonotonicdecreasing,• A[k]…A[n]A[1]!!ismonotonicincreasing

ORviseversa

A“better”definitionofBitonic Sequence

Abitonic sequence isasequencewithA[1]<=A[2]<=….<=A[k]>=…>= A[n-1]>=A[n]

forsomek(1<=k<=n),oracircularshiftofsuchasequence.

Inapicture

Bitonic:

NotBitonic

Ifrotated:TwoPeaks

A[1]>=A[2]>=….>=A[k]<=…<=A[n-1]<=A[n]leadstothesamedefinition

Bitonic “Merge”Bitonic_Merge (A) # A is a bitonic sequence

n = |A|if n == 1 then return Ahalf_n = floor(n/2)for i from 1 to half_n

c[i] = min(A[i],A[i+half_n])d[i] = max(A[i],A[i+half_n])

DOINPARALLELBitonic_Merge (c[1]…c[half_n])Bitonic_Merge (d[1]…d[half_n])

Return ( )

NotesBitonic Merge

• Eachc andd sequenceisabitonic sequenceagain

• Foralli: c[i] <= d[i]• Attheendwesortedbitonic sequencesoflength1,henceasortedsequence

Bitonic Mergealwaysyieldsbitonic sequences

•

Bitonic MergeNetwork•

Bitonic MergeNetwork(2)•

Bitonic MergeNetwork(3)

•

ParallelBitonic Sort

Bitonic_Sort (A)

n = |A|

if n == 1 then return Afor i from 0 to log(n)

DOINPARALLELforallk=m.2^i,k<nBitonic_Merge (A[k]…A[k+2^i-1])*

Return ( )

*Foroddvaluesofm,interchangeminandmax

NotesBitonic Sort

• Eachiterationcreateslongerandlongerbitonic sequences

• Inthelastiterationthewholesequenceisbitonic andthefinalbitonic mergecreatesasortedlist

Bitonic SortNetwork

•

four bitonic lists of length 2 constituting 2 bitonic lists of length 4

2 Bitonic Merge Networks

4 Bitonic Merge Networks

Whyalternatingmax/min?NotethatatthestartofeachBitonic MergeNetworkwehavetwoBitonic SequenceswhichconstitutesOneBitonicSequence!!!

Ifoneofthesesequencesis(monotonic)increasingandtheotheris(monotonic)decreasingthenthisisalwaysthecase.Ifbothareincreasingordecreasingthisisnotnecessarilythecase,i.e.

isnotbitonic

NotesBitonic SortNetwork• Assumen=2^k• Thebitonic mergestageshave1,2,3,…,kstepseach,sotimetosortis

T(n) =1+2+…+k=k(k-1)/2=O(k2)=O(log2 n)

• Eachsteprequiresn/2processors,sothetotalnumberofprocessorsisO((n/2) log2 n)

• Thenetworkcanhandledmultiplepipelined listproducingasortedlisteachtimestep

Parallel Sortingliacs.leidenuniv.nl/~wijshoffhag/PPI2017_2018/Lecture_10.pdf · (Sequential)...

Documents

Transcript of Parallel Sortingliacs.leidenuniv.nl/~wijshoffhag/PPI2017_2018/Lecture_10.pdf · (Sequential)...