GPU Accelerated Genomics Data Compression · 2014. 4. 10. ·...

GPU Accelerated Adap.ve Compression Framework for Genomics Data

GuiXin Guo, Shuang Qiu, ZhiQiang Ye, BingQiang Wang BGI Research, Shenzhen, China

Mian Lu Ins2tute of HPC, A*STAR, Singapore

Simon See BGI-‐NVIDIA Joint Innova2on Lab, Shenzhen, China

GTC2014 March 24-‐27, 2014 San Jose, CA

Contact [email protected] [email protected]

Outline

Ø Introduction Ø Adaptive Compression Framework Ø Implementation of Compression Algorithms Ø Results Ø Conclusion

Genomics Data: Exponen.al Growth

(a)  Cost of sequencing per M(base) DNA sequence has dropped from nearly $6000 in 2001 to slightly more than $0.10 in 2011.

(b)  Total number of completed sequence genomes has grown exponen.ally with decreasing sequencing costs.

Boyle, Nane\e R., and Gill, Ryan T. "Tools for genome-‐wide strain design and construc.on." Current Opinion in Biotechnology 23.5 (2012): 666-‐671.

Moore’s Law for Chips 2x performance per 18 months Moore’s Law for Genomics 10x data output per 18 months

•  Challenges proposed for storing and processing of huge volume of genomics data

•  BGI as an example –  Tens of TBs data generated per day –  Tens of PBs storage (several sites) –  Ten fold in the (not too far) future

•  Observa.on –  Computa.on in genomics features a much lower computa2on/IO ra2o than classical HPC workloads

–  IO (or data movement) becomes more expensive than computa.on

Can Compression Help?

Compression

•  Benefits of compression –  Reduce storage capacity (especially for archiving)

–  Reduce IO bandwidth (more balanced compu.ng systems architecture)

–  And, of course, save $$$

•  Compression is NOT for free –  Squeeze more, compute more

–  Squeeze less, compute less

•  Can GPU help?

sequence iden3fier

sequence bases

sequence iden3fier

quality scores

Take a Look at Genomics Data Files

•  Two common characteris.cs of genomics data files 1. Table contains mul.ple rows and columns

2. Data in the same column are with similar characteris.cs

@SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 GAATAAAGAAAAAATGGAAAACGAAGATGTTGAAATTTTTAATGATTATA +SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 I>I:1III9?9&I+II.6*,:'*1.?I%-&&67I0(1.",&$%2,+I4)+ @SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 GTATACGTATTATGAATATACTGATTATATAAGCATAAATAAATAAAATA +SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 IIIIIIIIIIIIIIIDIAI8%I-7II9I3I8@(%/EIA/>;G=DI9=8#6

Example of a FASTQ file containing two sequences.

Sequence iden.fier Sequence bases

Sequence iden.fier Quality scores

Column major table view

Sequence iden.fier Sequence bases

Sequence iden.fier Quality scores

Input Output

GPU Op3mized Compression Algorithms

Compression Schemes (Combina3on of Algorithms)

Mul3ple rounds of processing

Compress each column

Column Major Compression Engine

Transform to column major

Test and apply the best scheme against each column

Block #i

Workflow of Adap.ve Compression Framework

Algorithms Op.mized (Till Now)

BWT MTF Markov Transform

Huffman LZ77 GPU Accelerated

Transformational compression schemes

Substitutional compression schemes

Statistical Model-‐based compression

schemes

Commonly used compression schemes

Novel compression algorithm for quality scores (FASTQ)

First-order Markov model

Sorting frequencies of character pairs

Statistical scheme

Transformational sheme & &

Typical basic algorithms

e.g. quality scores e.g. sequence ID e.g. DNA sequence

Raw genomics data

Data with many similar strings

Data with limited alphabet

Random distributed data Text-‐like data

……

LZ77

Huffman MTF

BWT

Huffman

Markov transform

Huffman

Huffman

Four Schemes for Different Data Generic compression methods?

Domain-‐specific methods? Not efficient Works only on limited data formats

Problem still remains: Serial algorithms too slow

Column-major compression: • Flexible for new file formats • Extensible for new algorithms

Tested for best performance

Op.miza.on Techniques

Ø  Data parallel

Input Data / Output Data

…… Split to different block

Compressed Data 1

…… Algorithm 1

Compressed Data 2

Compressed Data n

Data Block 1 Data Block 2 Data Block n

Transformed Data 1

Transformed Data 2

Transformed Data 2

Algorithm k

…

…

…

Merge different block

Reverse algorithm 1

Reverse algorithm k

……

Simple but efficient scheme to parallelize MTF and its Reverse

Ø Increase the parallelism of selected algorithms • (Slightly) Alternate implementa.on of the algorithms to reduce data dependency

Ø Op.mize the implementa.on on GPU • Embrace state of the art, high performance libraries (e.g. b40c) • Be\er u.liza.on of constant memory and shared memory

Parallel Huffman Encoding and Decoding Input Data

Single-‐side Growing Huffman Tree

C0 C1 C2 …… Ck Codeword & Code length

L0 L1 L2 …… Lk

Constant Memory

Posi.on Array

Encoded data:

Encode characters in parallel

d cb

a

d cb

a

Huffman Tree

Fixed rela.on between codeword and code length Shared Memory

h1 h2 h2 … … hr r-‐bit string S thread 1

thread 2 thread 3

thread d d = the depth of the Huffman tree

... ...

Decode the encoded string S in parallel with d GPU threads for each character

generate

Memory efficient Parallel Huffman decoding

Serial in Huffman decoding

Auxiliary tables

stored

0 1 2 3

0 (0,0) (0,1) (0,2) (1,3)

1 (0,0) (0,1) (3,2) (3,3)

2 (1,0) (0,1) (4,2) (2,3)

3 (0,0) (5,1) (1,2) (0,3)

0 1 2 3

0 (1,3) (0,0) (0,1) (0,2)

1 (3,2) (3,3) (0,0) (0,1)

2 (4,2) (2,3) (1,0) (0,1)

3 (5,1) (1,2) (0,0) (0,3)

0 1 2 3

0 3 0 1 2

1 2 3 0 1

2 2 3 0 1

3 1 2 0 3

Markov Transform (A,B): A stores the frequency of a pair B represents the second character

Use the adjacent characters in the input to form character pairs and count the frequency of each pair

Use the frequency to sort each row in the table

Coded String: 1 1 1 1 0 0 2 0 0 1 0 0 0 1 0 1 0 0 0 0 0

String: 1 3 2 3 1 2 0 3 1 3 1 2 2 3 1 3 1 2 2 2 2 … …

Lookup the table: use the previous character as the row index, search the current character to get its index in the column, and take the index as its coding value

Lookup table only characters stored

1 3 2 …

search

Parallelism on GPU by using atomicAdd

One row is sorted by one block of thread

Each character can be parallel processed

Quality score Locally alike High data redundancy Hard to compress

Markov transform Lightweight High parallelism

Good solu.on

We propose

-20

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9Co

mpression

rate (M

B/s)

Compression level

gzip

bzip2

lzip

lzo

Compression rate for fastq file

0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5 6 7 8 9

Compression

ratio（

%）

Compression level

Compression ratio for fastq file

gzip

bzip2

lzip

lzo

bzip2: good & stable compression ra3o, but low compression rate Data dependency leads to difficul3es for GPU accelera3on!

bzip2: Challenge for GPU Accelera.on

Patel, R. A., Zhang, Y., Mak, J., Davidson, A., & Owens, J. D. (2012). Parallel lossless data compression on the gpu (pp. 1-‐9). IEEE.

Data (represented as a string of length N)

Burrows-‐Wheeler Transform

Move-‐to-‐Front Transform

Huffman Coding

Compressed Data

a b a b a c a b a c

c c b b b a a a a a

99 0 99 0 0 99 0 0 0 0

1 0 1 0 0 1 0 0 0 0

Example

BW-‐transformed string of length N

N-‐sized byte array of indices to MTF list

M-‐sized bit string of encoded data

Compressio

n

Decompression

Most time-consuming!!!

bzip2: Workflow Bzip2-‐like Compression Method Pipeline

a b a b a c a b a c c a b a b a c a b a a c a b a b a c a b b a c a b a b a c a a b a c a b a b a c c a b a c a b a b a a c a b a c a b a b b a c a b a c a b a a b a c a b a c a b b a b a c a b a c a

a b a b a c a b a c a b a c a b a b a c a b a c a b a c a b a c a b a b a c a b a c a b a c a b a b b a b a c a b a c a b a c a b a b a c a b a c a b a c a b a c a b a b a c a b a c a b a c a b a b a

Rota.on Sor.ng

String: a b a b a c a b a c

BWT string

Index 0

Most compute intensive

Increase Parallelism of BWT Burrows-‐Wheeler Transforma3on

0 1 0 1 0 2 0 1 0 2

0 3 1 4 2 5 1 4 2 5

0 2 0 2 1 3 0 2 1 3

0 5 2 7 4 9 1 6 3 8

R1

R2

R4

R8

S

SA

0 1 2 3 4 5 6 7 8 9

a b a b a c a b a c

0 1 0 1 0 2 0 1 0 2

0 2 0 2 1 3 0 2 1 3

0 3 1 4 2 5 1 4 2 5

0 5 2 7 4 9 1 6 3 8

0 6 2 8 4 1 7 3 9 5

Prefix doubling*

Radix sor.ng

Use a high performance sor.ng library b40c to sort the rank array

Transform the rank array to suffix array in

parallel Get the result of BWT in

parallel

* Sun, Weidong, and Zongmin Ma. "Parallel lexicographic names construc.on with CUDA." Parallel and Distributed Systems (ICPADS), 2009 15th Interna2onal Conference on. IEEE, 2009.

(a) c c b b b a a a a a (d) a a a a a b b b c c 0 1 2 3 4 5 6 7 8 9 5 6 7 8 9 2 3 4 0 1

(b) a a a a a b b b c c (e) a a a a a b b b c c 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1

(c) a a a a a b b b c c (f) a a a a a b b b c c 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1

(d) a a a a a b b b c c 5 6 7 8 9 2 3 4 0 1

Sor3ng

a c

ca ab

cab aba

caba abab

ababa cabac

ababacabac

Different threads start at different posi.ons for the BWT reverse simultaneously

Improve Parallelism of BWT Reverse

More indices are stored in the BWT process to parallelize the BWT reverse

Sort the BWT string

c c b b b a a a a a

0 6 2 8 4 1 7 3 9 5

BWT string

SA (suffix array) Index 0 Index 5

BWT Reverse c a ba a c ba ba

Backward reverse: one character by another from a start position

Serial in nature!

Solu3on

Sorting plays its role!

d-‐bits … … d-‐bitsk-‐bit keys:

key 2 key 7 key 1 key 3 key 9 key 5 … … key m key 8

key 1 key 2 key 3 key 4 … … key m … … key n

count 1 count 2 … … count 2d

key 5 key 7 key 1 key n key 8 key 4 … … key m key 3

count 1 count 2 … … count 2d

key 7 key 1 key 9 key 5 key 3 key n … … key 2 key m

……

Bucket 1 Bucket 2 Bucket 2droun

d 1

roun

d r

Bucket 1 Bucket 2 Bucket 2d

Radix sort

Radix sort implemented by b40c is memory bandwidth bounded. Sor3ng is s3ll the boileneck!

Radix Sort on GPU

Round of sor3ng: r = k/d Memory read and write: (2n+n)*r

15 12 5 9 2 17 13 8 11 3 6 18 7 1 14 10 16 4

3 7 9 10 13 15

15 9 13 3 7 10

5 2 3 6 1 4 12 9 8 11 7 10 15 17 13 18 14 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Part 0

Select the samples

Sort the samples

Select pivot & Sca\er keys to the buckets

Locally sort each bucket

Part 1 Part 2 Sample Sort on GPU*

Bucket 0 Bucket 1 Bucket 2

Bucket 0 Bucket 1 Bucket 2

n keys:

pivot

Efficient u3liza3on of the shared memory

Memory read and write: 2n + 2n

Sample Sort on GPU

* Leischner, N., Osipov, V., & Sanders, P. (2010, April). GPU sample sort. InParallel & Distributed Processing (IPDPS), 2010 IEEE Interna2onal Symposium on (pp. 1-‐10). IEEE.

Performance of GPU-‐Accelerated Algorithms

Compression method Transform Reverse transform

BWT 4x 31x

MTF 8x 8x

Markov 18x N/A

Huffman 2x 5x

The improvement of the parallelism in BWT reverse

The intrinsic parallelism of the new designed Markov transform

CPU: Intel Xeon E5630 @2.53GHz GPU: Tesla M2050 / 3GB CUDA: 4.0

Compression method

Compression rate (MB/s)

Decompression rate (MB/s) Compression

ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 8.24 26.64 24.31

gzip (CPU) 8.45 114.23 29.46

BWT+MTF+Huffman 16.04 21.95 73.90 83.66 31.06

Markov+Huffman 115.60 204.86 77.23 90.87 36.68

Huffman 179.83 215.85 128.39 142.16 60.50

This work 77.80 97.26 124.37 127.78 24.77

Compression Performance of FASTQ File

Similar compression ra.o 11.8x Speed up for compression 4.8x Speed up for decompression (Improvement possible with more work)

Compression method



ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 6.27 24.63 26.71

gzip (CPU) 7.73 106.41 32.26

BWT+MTF+Huffman 15.99 21.78 74.14 80.04 32.55

Markov+Huffman 116.71 206.18 87.09 90.14 39.66

Huffman 177.11 222.22 127.48 144.49 57.69

This work 87.45 98.14 139.93 149.68 26.46

Compression Performance of SAM File

Similar compression ra.o 15.6x Speed up for compression 6.1x Speed up for decompression (Improvement possible with more work)

Compression method



ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 8.93 22.61 38.47

gzip (CPU) 9.83 93.59 42.35

BWT+MTF+Huffman 9.14 12.38 57.01 75.18 43.86

Huffman 185.32 231.16 121.66 129.28 49.42

Markov+Huffman 97.75 176.65 83.67 88.10 42.01

Markov Transform for Quality Scores

Comparison to Domain-‐Specific Methods Compression

method Compression rate

(MB/s) Decompression

rate (MB/s) Compression

ratio (%)

gzip 12.2 45.4 35.35

bzip2 7.0 13.0 29.05

SCALCE 7.8 13.1 25.72

DSRC 13.5 32.2 24.77

quip 8.3 10.9 22.19

fasqz 4.6 3.8 21.95

fqzcomp 8.2 8.3 21.72

Seqsqueeze1 0.6 0.6 21.87

Column major block compression 111.0 104.4 29.46

23

Conclusion

• We presented adap.ve compression framework for genomics data accelerated by GPU, which works very well

• Column major compression

• Novel algorithm for data like quality score

• Generic and extensible • Compression on GPU is not easy, sor.ng is s.ll a bo\leneck

24

Contact: [email protected]

GPU Accelerated Genomics Data Compression · 2014. 4. 10. ·...

Documents

Transcript of GPU Accelerated Genomics Data Compression · 2014. 4. 10. ·...