Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024...

36
Philip Bille Radix and Suffix Sorting Radix Sort Sux Sort

Transcript of Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024...

Page 1: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Philip Bille

Radix and Suffix Sorting

• Radix Sort• Suffix Sort

Page 2: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Radix and Suffix Sorting

• Radix Sort• Suffix Sort

Page 3: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Sorting small universes. Given a sequence of n integers from a universe U = {0, 1, ..., u-1}.

• How fast can we sort sequence if the size of the universe is not too big?

Radix Sort

Page 4: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Radix Sort

0010010101

0000001111

n = 10, U = {0,1}

• Algorithm. Count 0s and 1s.• Time. O(n).

Page 5: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Algorithm. Insert into array of linked list + traverse array of linked list. • Time. O(n + u) = O(n)• Sorting can be stable.

Radix Sort

7512240497

n = 10, U = {0,1, ..., n-1 = 9}

10122445779

0

2 2

4 4

7 7

5

9

Page 6: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Radix Sort

002024045066075090170200802905

170045075090802024002066200905

n = 10, U = {0, ..., n3 - 1 = 999}

170090200802002024045075905066

200802002905024045066170075090

• Radix Sort. Sort on each digit from right to left using stable sort. • Time. O(n + n + n) = O(n)

Page 7: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Radix Sort [Hollerith 1887]. Sort sequence of n integers from U = {0, ..., nk-1}.• Write each element in sequence as a base n integer x = (x1, x2, .., xk)• Sort sequence according to each digit from right to left. Sorting should be stable.

• Time. O(nk)

Radix Sort

Page 8: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Radix and Suffix Sorting

• Radix Sort• Suffix Sort

Page 9: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Suffix sorting. Given string S of length n over alphabet Σ, compute the sorted lexicographic order of all suffixes of S.

• Theorem [Kasai et al. 2001]. Given the sorted lexicographic order of suffixes of S, we can construct the suffix tree for S in linear time.

Suffix Sort

$abbadabbado$abbado$adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

yabbadabbado$0

1

2

3

4

5

6

7

8

9

1011

12

Page 10: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Suffix sorting. Given string S of length n over alphabet Σ, compute the sorted lexicographic order of all suffixes of S.

• Theorem [Kasai et al. 2001]. Given the sorted lexicographic order of suffixes of S, we can construct the suffix tree for S in linear time.

Suffix Sort

$abbadabbado$abbado$adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

yabbadabbado$0

1

2

3

4

5

6

7

8

9

1011

12

0

1

8

9

5

4

7

6

2

3

1011

12

Rank of each suffix

Suffixes in sorted order

Page 11: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Suffix trees and sorting. The lexicographic order of the suffixes is the same ordering as suffixes in the leaves of the suffix tree.

• Suffix array. The array of the sorted order of the suffixes.

Suffix Sort

bbad

a

d

abbado$

o$

ad

abbado$

o$

abbado$

o$

abbado$

o$

bad

d

abbado$

o$

o$ y

abbadabbado$

1

6

4

9

3

8

2

7

5

10

11

0

$

12

b

12 1 6 4 9 3 8 2 7 5 10 11 0

Page 12: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Goal. Compute the lexicographic order of all suffixes of S fast.• For simplicity assume |Σ| = O(n)• Solution in 3 steps.

• Solution 1: Radix sorting• Solution 2: Prefix doubling • Solution 3: Difference cover sampling

Suffix Sort

Page 13: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Radix Sort.• Generate all suffixes (pad with $).• Radix sort.

yabbadabbado$abbadabbado$$bbadabbado$$$badabbado$$$$adabbado$$$$$dabbado$$$$$$abbado$$$$$$$bbado$$$$$$$$bado$$$$$$$$$ado$$$$$$$$$$do$$$$$$$$$$$o$$$$$$$$$$$$$$$$$$$$$$$$$

Solution 1: Radix Sort

• Time. O(n2)

Page 14: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Prefix doubling [Manber and Myers 1990]. Sort substrings (padded with $) of lengths 1, 2, 4, 8, ..., n. Each step uses radix sort on pair from previous step.

Solution 2: Prefix Doubling

84134235215413423627607000

yabbabbabbad

dabb

$$$$o$$$

badaadab

abbabbadbado

do$$ado$

yabbadabbado$

51122221133112222113344000

yaabbb

da

$$o$

baad

abbbba

doad

• Time. O(n log n)

0

1

1

2

3

4

5

6

6

7

89

10

0

1

1

2

2

3

3

4

4

5

67

8

0

1

1

1

1

2

2

2

2

3

34

5

Page 15: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• DC3 Algorithm [Karkkainen et al. 2003]. Sort suffixes in three steps:• Step 1. Sort sample suffixes.

• Sample all suffixes starting at positions i = 1 mod 3 and i = 2 mod 3.• Recursively sort sample suffixes.

• Step 2. Sort non-sample suffixes.• Sort the remaining suffixes (starting at positions i = 0 mod 3).

• Step 3. Merge.• Merge sample and non-sample suffixes.

Solution 3: Difference Cover Sampling

Page 16: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 1: Sort Sample Suffixes

yabbadabbado$$

abbadabbado$

bbadabbado$$

abbadabbado$bbadabbado$$

yabbadabbado$$

Page 17: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 1: Sort Sample Suffixes

1

4

7

10

yabbadabbado$$

abbadabbado$

abbadabbado$bbadabbado$$

yabbadabbado$$

Page 18: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 1: Sort Sample Suffixes

yabbadabbado$$

abbadabbado$

bbadabbado$$

abbadabbado$bbadabbado$$

yabbadabbado$$

2

5

8

11

Page 19: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 1: Sort Sample Suffixes

yabbadabbado$$

abbadabbado$

bbadabbado$$

abbadabbado$bbadabbado$$

yabbadabbado$$

1

4

257

836

Page 20: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 1: Sort Sample Suffixes

yabbadabbado$$

abbadabbado$

bbadabbado$$

abbadabbado$bbadabbado$$

yabbadabbado$$

14

26

53

78

1

4

257

836

Page 21: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 2: Sort Non-Sample Suffixes

yabbadabbado$

yabbadabbado$

14

26

53

78

y1

b2

a5

a7

4

3

2

1

0$0

Page 22: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1a4

a5

Page 23: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2a5

a6

Page 24: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

a7

a6 3

Page 25: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

ad8ba7

3

4

Page 26: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

ba6

ba7

3

4

5

Page 27: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

ya4

ba7

3

4

5

6

Page 28: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

ya4

bb2

3

4

5

6

7

Page 29: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

y1

b3

3

4

5

6

7

8

Page 30: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

ya4

da53

4

5

6

7

8

9

Page 31: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

y1

d8

3

4

5

6

7

8

9

10

Page 32: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

y1

o0

3

4

5

6

7

8

9

1011

Page 33: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Step 3: Merge

14

26

53

78

4

3

2

1

0

yabbadabbado$

yabbadabbado$0

1

2

y1

3

4

5

6

7

8

9

1011

12

Page 34: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• DC3 Algorithm. Sort suffixes in three steps:• Step 1. Sort sample suffixes.

• Sample all suffixes starting at positions i = 1 mod 3 and i = 2 mod 3.• Recursively sort sample suffixes.

• Step 2. Sort non-sample suffixes.• Sort the remaining suffixes (starting at positions i = 0 mod 3).

• Step 3. Merge.• Merge sample and non-sample suffixes.

• T(n) = time to suffix sort a string of length n over alphabet of size n

Solution 3: Difference Cover Sampling

O(n)T(2n/3)

O(n)

O(n)

• Time. T(n) = T(2n/3) + O(n) = O(n)

Page 35: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

• Theorem. We can suffix sort a string of length n over alphabet Σ of size n in time O(n).

• Theorem. We can suffix sort a string of length n over alphabet Σ O(sort(n, |Σ|)) time.

Solution 3: Difference Cover Sampling

Page 36: Radix and Suffix Sortingcourses.compute.dtu.dk/02282/2021/suffixsorting/suffix...Radix Sort 002 024 045 066 075 090 170 200 802 905 170 045 075 090 802 024 002 066 200 905 n = 10,

Radix and Suffix Sorting

• Radix Sort• Suffix Sort