Parallel Sortingliacs.leidenuniv.nl/~wijshoffhag/PPI2017_2018/Lecture_10.pdf · (Sequential)...
Transcript of Parallel Sortingliacs.leidenuniv.nl/~wijshoffhag/PPI2017_2018/Lecture_10.pdf · (Sequential)...
ParallelSorting
Ajungle
•
Illustration
https://www.youtube.com/watch?v=kPRA0W1kECg
(Sequential)Sorting
• BubbleSort,InsertionSort– O(n2)
• MergeSort,HeapSort,QuickSort– O(nlogn)– QuickSort bestonaverage
• Optimal Parallel Timecomplexity– O(nlogn)/P– IfP=NthenO(logn)
InsertionSortInsertion_Sort (A)
for i from 1 to |A| - 1j = iwhile j > 0 and A[j-1] > A[j]
swap A[j] and A[j-1]j = j – 1
Return ( A )
Inherentlysequentialsohardtoparallelize!!!!è Onlythroughpipelining canspeedupberealized
PipelinedInsertionSort•
Tpipelined =2n,withnprocessors,somaximalspeedup=n/4– 3/4(wortcasesequentialtime=(n-1)(n-2)/2=n2/2-3n/2+2/2)
ParallelMergeSortMerge_Sort (A)
n = |A|halfway = floor(n/2)
DOINPARALLELMerge_Sort (A[1]… A[halfway])Merge_Sort (A[halfway+1]… A[n])
j = 1; current = 1for i from 1 to halfway
while j ≤ n-halfway and A[halfway + j] < A[i]X[current] = A[halfway + j]j = j + 1; current = current+1
X[current] = A[i]current = current+1
Return ( X )
halfway halfway + j ni
A
Inapicture
•
NotesMergeSort
• Collectssortedlistontooneprocessor,mergingasitemscometogether
• Mapswelltotree structure,sortinglocallyonleaves,thenmergingupthetree
• Asitemsapproachrootoftree,processorsaredropped,limitingparallelism
• O(n),ifP=n(1+2+4+…+n/2+n)=n(1+1/2+1/4…)=n.2
ParallelQuickSortQuickSort (A)
if |A| == 1 then return Ai = rand_int (|A|)p = A[i]DOINPARALLEL
L = QuickSort({a A|a < p})E = {a A|a = p}G = QuickSort({a A|a > p})
Return ( L || E || G )
∈
∈
∈
∈
IfweassumethatthepivotsarechosensuchthatLandGareaboutequalinsize,then
Sequential:T(n)=2T(n/2)+O(n)=O(nlogn)Infactitcanbeproventhatthisalwaysholds!
Forparallelexecution thechoiceofi iscrucialforloadbalance.Evenmoreimportantlywewouldliketochoosemultiplepivots(p-1)atthesametime,sothateachtimewegetppartitions whichcanbeexecutedinparallel.
Ppartitions• Foragivenp(numberofpivots)ands(oversamplingrate),firstselectatrandomp*scandidatepivots
for i from 1 to p*s
Cand[i] = rand_int (|A|)
• Sort thelistofcandidatepivots:Cand[i]• ChooseCand[s],Cand[2*s]…Cand[(p-1)*s]Findagoodvaluefortheoversamplingrate:s>1,
è sshouldnotleadtoverylongsortingtimes
ParallelRadixSortInsteadofcomparingvalues:COMPAREDIGITS
Radix_Sort (A, b) # Assumebinaryrepresentationsofkeysfor i from 0 to b-1
FLAGS = { (a>>i) mod 2 | a A } NOTFLAGS = { 1-FLAGS[a] | a A }R_0 = SCAN (NOTFLAGS)s_0 = SUM (NOTFLAGS)R_1 = SCAN (FLAGS)R = {if FLAGS[j] == 0
then R_0[j]else R_1[j] + s_0| j [0…|A|-1}
A = A sorted by RReturn ( A )
∈∈
∈
(a>>i) mod 2: rightshift i times,soe.g.01101>>2 mod2 =00011 mod 2 = 1
So(a>>i) mod 2 equalsthe(i+1)th rightmostbitofa
LSD/MSDRadixSort
Insteadof(a>>i) mod 2
onecanalsoimplementsRadixSortwith:(a<<i) div 2^(b-1)
ThefirstimplementationiscalledleastsignificantdigitRadixSortorLSDRadixSortThelatteronisMSDRadixSort
NotesRadixSort
ØSequentialtimecomplexity:T(n)=O(b.n),
biterations,eachiterationO(n)ØNotethatb≈logn,soatotalofO(nlogn)ØInsteadofsingledigitsablockofrdigitscanbetakeneachtime,resultinginb/r iterations
Illustration(LSDRadixSort)
•
SortingofeachselecteddigitinRadixSort,withPrefixSumBasedSorting
Eachelementi oftheprefixsumarrayhastheSUMofallelementswhichindexissmallerthani
Whatistherelationshipwithsorting?
•
ØAllbitswhichareequalto0areflaggedwitha1ØComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]inthelocationindicatedbytheprefixsum
Secondstage
•
ØAllbitswhichareequalto1 areflaggedwitha1ØComputePrefixSumofthisflagarrayØ Storeallflagged(1)entriesofx[k]inthenextlocationsindicatedbytheprefixsum
Whataboutparallelexecution?
• Computationallythesortingalgorithmisreducedtocomputingtheprefixsumarraysforeachbitranking.
• However,computingtheseprefixsumarraysseemstobeinherentlysequential.Ornot?
ParallelExecutionofPrefixSums
Prefix_Sum (X) # X a n-bit array
for index from 0 to log nDOINPARALLELforallkif k >= 2^index thenX[k] = X[k]+X[k-2^index]
X >> 1 #Shift all entries to the rightReturn ( X )
IllustrationofparallelPrefixSums
•
ImprovingCachePerformanceØ Theparallelprefixsumalgorithmrequiresthewholearraytobe
fetchedateachiterationØ BadcacheperformanceØ ThroughTilingTechniquestheXarraycanbecutintoslices(tiles)Ø Onceeverynumberofiterationsre-tile!!Ø ACUDAimplementationoftheoverallalg.canbefoundon
https://github.com/debdattabasu/amp-radix-sort
2index
X
P2
P1
P3
Bitonic Sorting
Basedonbitonic sequences:
A[1],A[2],….,A[n-1],A[n]isbitonic,iffthereisaj andk suchthat• A[1]…A[j]ismonotonicincreasing,• A[j]…A[k]ismonotonicdecreasing,• A[k]…A[n]A[1]!!ismonotonicincreasing
ORviseversa
A“better”definitionofBitonic Sequence
Abitonic sequence isasequencewithA[1]<=A[2]<=….<=A[k]>=…>= A[n-1]>=A[n]
forsomek(1<=k<=n),oracircularshiftofsuchasequence.
Inapicture
Bitonic:
NotBitonic
Ifrotated:TwoPeaks
A[1]>=A[2]>=….>=A[k]<=…<=A[n-1]<=A[n]leadstothesamedefinition
Bitonic “Merge”Bitonic_Merge (A) # A is a bitonic sequence
n = |A|if n == 1 then return Ahalf_n = floor(n/2)for i from 1 to half_n
c[i] = min(A[i],A[i+half_n])d[i] = max(A[i],A[i+half_n])
DOINPARALLELBitonic_Merge (c[1]…c[half_n])Bitonic_Merge (d[1]…d[half_n])
Return ( )
NotesBitonic Merge
• Eachc andd sequenceisabitonic sequenceagain
• Foralli: c[i] <= d[i]• Attheendwesortedbitonic sequencesoflength1,henceasortedsequence
Bitonic Mergealwaysyieldsbitonic sequences
•
Bitonic MergeNetwork•
Bitonic MergeNetwork(2)•
Bitonic MergeNetwork(3)
•
ParallelBitonic Sort
Bitonic_Sort (A)
n = |A|
if n == 1 then return Afor i from 0 to log(n)
DOINPARALLELforallk=m.2^i,k<nBitonic_Merge (A[k]…A[k+2^i-1])*
Return ( )
*Foroddvaluesofm,interchangeminandmax
NotesBitonic Sort
• Eachiterationcreateslongerandlongerbitonic sequences
• Inthelastiterationthewholesequenceisbitonic andthefinalbitonic mergecreatesasortedlist
Bitonic SortNetwork
•
four bitonic lists of length 2 constituting 2 bitonic lists of length 4
2 Bitonic Merge Networks
4 Bitonic Merge Networks
Whyalternatingmax/min?NotethatatthestartofeachBitonic MergeNetworkwehavetwoBitonic SequenceswhichconstitutesOneBitonicSequence!!!
Ifoneofthesesequencesis(monotonic)increasingandtheotheris(monotonic)decreasingthenthisisalwaysthecase.Ifbothareincreasingordecreasingthisisnotnecessarilythecase,i.e.
isnotbitonic
NotesBitonic SortNetwork• Assumen=2^k• Thebitonic mergestageshave1,2,3,…,kstepseach,sotimetosortis
T(n) =1+2+…+k=k(k-1)/2=O(k2)=O(log2 n)
• Eachsteprequiresn/2processors,sothetotalnumberofprocessorsisO((n/2) log2 n)
• Thenetworkcanhandledmultiplepipelined listproducingasortedlisteachtimestep