GPU Accelerated Genomics Data Compression · 2014. 4. 10. ·...
Transcript of GPU Accelerated Genomics Data Compression · 2014. 4. 10. ·...
GPU Accelerated Adap.ve Compression Framework for Genomics Data
GuiXin Guo, Shuang Qiu, ZhiQiang Ye, BingQiang Wang BGI Research, Shenzhen, China
Mian Lu Ins2tute of HPC, A*STAR, Singapore
Simon See BGI-‐NVIDIA Joint Innova2on Lab, Shenzhen, China
GTC2014 March 24-‐27, 2014 San Jose, CA
Contact [email protected] [email protected]
Outline
Ø Introduction Ø Adaptive Compression Framework Ø Implementation of Compression Algorithms Ø Results Ø Conclusion
Genomics Data: Exponen.al Growth
(a) Cost of sequencing per M(base) DNA sequence has dropped from nearly $6000 in 2001 to slightly more than $0.10 in 2011.
(b) Total number of completed sequence genomes has grown exponen.ally with decreasing sequencing costs.
Boyle, Nane\e R., and Gill, Ryan T. "Tools for genome-‐wide strain design and construc.on." Current Opinion in Biotechnology 23.5 (2012): 666-‐671.
Moore’s Law for Chips 2x performance per 18 months Moore’s Law for Genomics 10x data output per 18 months
• Challenges proposed for storing and processing of huge volume of genomics data
• BGI as an example – Tens of TBs data generated per day – Tens of PBs storage (several sites) – Ten fold in the (not too far) future
• Observa.on – Computa.on in genomics features a much lower computa2on/IO ra2o than classical HPC workloads
– IO (or data movement) becomes more expensive than computa.on
Can Compression Help?
Compression
• Benefits of compression – Reduce storage capacity (especially for archiving)
– Reduce IO bandwidth (more balanced compu.ng systems architecture)
– And, of course, save $$$
• Compression is NOT for free – Squeeze more, compute more
– Squeeze less, compute less
• Can GPU help?
sequence iden3fier
sequence bases
sequence iden3fier
quality scores
Take a Look at Genomics Data Files
• Two common characteris.cs of genomics data files 1. Table contains mul.ple rows and columns
2. Data in the same column are with similar characteris.cs
@SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 GAATAAAGAAAAAATGGAAAACGAAGATGTTGAAATTTTTAATGATTATA +SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 I>I:1III9?9&I+II.6*,:'*1.?I%-&&67I0(1.",&$%2,+I4)+ @SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 GTATACGTATTATGAATATACTGATTATATAAGCATAAATAAATAAAATA +SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 IIIIIIIIIIIIIIIDIAI8%I-7II9I3I8@(%/EIA/>;G=DI9=8#6
Example of a FASTQ file containing two sequences.
Sequence iden.fier Sequence bases
Sequence iden.fier Quality scores
Column major table view
Sequence iden.fier Sequence bases
Sequence iden.fier Quality scores
Input Output
GPU Op3mized Compression Algorithms
Compression Schemes (Combina3on of Algorithms)
Mul3ple rounds of processing
Compress each column
Column Major Compression Engine
Transform to column major
Test and apply the best scheme against each column
Block #i
Workflow of Adap.ve Compression Framework
Algorithms Op.mized (Till Now)
BWT MTF Markov Transform
Huffman LZ77 GPU Accelerated
Transformational compression schemes
Substitutional compression schemes
Statistical Model-‐based compression
schemes
Commonly used compression schemes
Novel compression algorithm for quality scores (FASTQ)
First-order Markov model
Sorting frequencies of character pairs
Statistical scheme
Transformational sheme & &
Typical basic algorithms
e.g. quality scores e.g. sequence ID e.g. DNA sequence
Raw genomics data
Data with many similar strings
Data with limited alphabet
Random distributed data Text-‐like data
……
LZ77
Huffman MTF
BWT
Huffman
Markov transform
Huffman
Huffman
Four Schemes for Different Data Generic compression methods?
Domain-‐specific methods? Not efficient Works only on limited data formats
Problem still remains: Serial algorithms too slow
Column-major compression: • Flexible for new file formats • Extensible for new algorithms
Tested for best performance
Op.miza.on Techniques
Ø Data parallel
Input Data / Output Data
…… Split to different block
Compressed Data 1
…… Algorithm 1
Compressed Data 2
Compressed Data n
Data Block 1 Data Block 2 Data Block n
Transformed Data 1
Transformed Data 2
Transformed Data 2
Algorithm k
…
…
…
Merge different block
Reverse algorithm 1
Reverse algorithm k
……
Simple but efficient scheme to parallelize MTF and its Reverse
Ø Increase the parallelism of selected algorithms • (Slightly) Alternate implementa.on of the algorithms to reduce data dependency
Ø Op.mize the implementa.on on GPU • Embrace state of the art, high performance libraries (e.g. b40c) • Be\er u.liza.on of constant memory and shared memory
Parallel Huffman Encoding and Decoding Input Data
Single-‐side Growing Huffman Tree
C0 C1 C2 …… Ck Codeword & Code length
L0 L1 L2 …… Lk
Constant Memory
Posi.on Array
Encoded data:
Encode characters in parallel
d cb
a
d cb
a
Huffman Tree
Fixed rela.on between codeword and code length Shared Memory
h1 h2 h2 … … hr r-‐bit string S thread 1
thread 2 thread 3
thread d d = the depth of the Huffman tree
... ...
Decode the encoded string S in parallel with d GPU threads for each character
generate
Memory efficient Parallel Huffman decoding
Serial in Huffman decoding
Auxiliary tables
stored
0 1 2 3
0 (0,0) (0,1) (0,2) (1,3)
1 (0,0) (0,1) (3,2) (3,3)
2 (1,0) (0,1) (4,2) (2,3)
3 (0,0) (5,1) (1,2) (0,3)
0 1 2 3
0 (1,3) (0,0) (0,1) (0,2)
1 (3,2) (3,3) (0,0) (0,1)
2 (4,2) (2,3) (1,0) (0,1)
3 (5,1) (1,2) (0,0) (0,3)
0 1 2 3
0 3 0 1 2
1 2 3 0 1
2 2 3 0 1
3 1 2 0 3
Markov Transform (A,B): A stores the frequency of a pair B represents the second character
Use the adjacent characters in the input to form character pairs and count the frequency of each pair
Use the frequency to sort each row in the table
Coded String: 1 1 1 1 0 0 2 0 0 1 0 0 0 1 0 1 0 0 0 0 0
String: 1 3 2 3 1 2 0 3 1 3 1 2 2 3 1 3 1 2 2 2 2 … …
Lookup the table: use the previous character as the row index, search the current character to get its index in the column, and take the index as its coding value
Lookup table only characters stored
1 3 2 …
search
Parallelism on GPU by using atomicAdd
One row is sorted by one block of thread
Each character can be parallel processed
Quality score Locally alike High data redundancy Hard to compress
Markov transform Lightweight High parallelism
Good solu.on
We propose
-20
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9Co
mpression
rate (M
B/s)
Compression level
gzip
bzip2
lzip
lzo
Compression rate for fastq file
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9
Compression
ratio(
%)
Compression level
Compression ratio for fastq file
gzip
bzip2
lzip
lzo
bzip2: good & stable compression ra3o, but low compression rate Data dependency leads to difficul3es for GPU accelera3on!
bzip2: Challenge for GPU Accelera.on
Patel, R. A., Zhang, Y., Mak, J., Davidson, A., & Owens, J. D. (2012). Parallel lossless data compression on the gpu (pp. 1-‐9). IEEE.
Data (represented as a string of length N)
Burrows-‐Wheeler Transform
Move-‐to-‐Front Transform
Huffman Coding
Compressed Data
a b a b a c a b a c
c c b b b a a a a a
99 0 99 0 0 99 0 0 0 0
1 0 1 0 0 1 0 0 0 0
Example
BW-‐transformed string of length N
N-‐sized byte array of indices to MTF list
M-‐sized bit string of encoded data
Compressio
n
Decompression
Most time-consuming!!!
bzip2: Workflow Bzip2-‐like Compression Method Pipeline
a b a b a c a b a c c a b a b a c a b a a c a b a b a c a b b a c a b a b a c a a b a c a b a b a c c a b a c a b a b a a c a b a c a b a b b a c a b a c a b a a b a c a b a c a b b a b a c a b a c a
a b a b a c a b a c a b a c a b a b a c a b a c a b a c a b a c a b a b a c a b a c a b a c a b a b b a b a c a b a c a b a c a b a b a c a b a c a b a c a b a c a b a b a c a b a c a b a c a b a b a
Rota.on Sor.ng
String: a b a b a c a b a c
BWT string
Index 0
Most compute intensive
Increase Parallelism of BWT Burrows-‐Wheeler Transforma3on
0 1 0 1 0 2 0 1 0 2
0 3 1 4 2 5 1 4 2 5
0 2 0 2 1 3 0 2 1 3
0 5 2 7 4 9 1 6 3 8
R1
R2
R4
R8
S
SA
0 1 2 3 4 5 6 7 8 9
a b a b a c a b a c
0 1 0 1 0 2 0 1 0 2
0 2 0 2 1 3 0 2 1 3
0 3 1 4 2 5 1 4 2 5
0 5 2 7 4 9 1 6 3 8
0 6 2 8 4 1 7 3 9 5
Prefix doubling*
Radix sor.ng
Use a high performance sor.ng library b40c to sort the rank array
Transform the rank array to suffix array in
parallel Get the result of BWT in
parallel
* Sun, Weidong, and Zongmin Ma. "Parallel lexicographic names construc.on with CUDA." Parallel and Distributed Systems (ICPADS), 2009 15th Interna2onal Conference on. IEEE, 2009.
(a) c c b b b a a a a a (d) a a a a a b b b c c 0 1 2 3 4 5 6 7 8 9 5 6 7 8 9 2 3 4 0 1
(b) a a a a a b b b c c (e) a a a a a b b b c c 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1
(c) a a a a a b b b c c (f) a a a a a b b b c c 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1
(d) a a a a a b b b c c 5 6 7 8 9 2 3 4 0 1
Sor3ng
a c
ca ab
cab aba
caba abab
ababa cabac
ababacabac
Different threads start at different posi.ons for the BWT reverse simultaneously
Improve Parallelism of BWT Reverse
More indices are stored in the BWT process to parallelize the BWT reverse
Sort the BWT string
c c b b b a a a a a
0 6 2 8 4 1 7 3 9 5
BWT string
SA (suffix array) Index 0 Index 5
BWT Reverse c a ba a c ba ba
Backward reverse: one character by another from a start position
Serial in nature!
Solu3on
Sorting plays its role!
d-‐bits … … d-‐bitsk-‐bit keys:
key 2 key 7 key 1 key 3 key 9 key 5 … … key m key 8
key 1 key 2 key 3 key 4 … … key m … … key n
count 1 count 2 … … count 2d
key 5 key 7 key 1 key n key 8 key 4 … … key m key 3
count 1 count 2 … … count 2d
key 7 key 1 key 9 key 5 key 3 key n … … key 2 key m
……
Bucket 1 Bucket 2 Bucket 2droun
d 1
roun
d r
Bucket 1 Bucket 2 Bucket 2d
Radix sort
Radix sort implemented by b40c is memory bandwidth bounded. Sor3ng is s3ll the boileneck!
Radix Sort on GPU
Round of sor3ng: r = k/d Memory read and write: (2n+n)*r
15 12 5 9 2 17 13 8 11 3 6 18 7 1 14 10 16 4
3 7 9 10 13 15
15 9 13 3 7 10
5 2 3 6 1 4 12 9 8 11 7 10 15 17 13 18 14 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Part 0
Select the samples
Sort the samples
Select pivot & Sca\er keys to the buckets
Locally sort each bucket
Part 1 Part 2 Sample Sort on GPU*
Bucket 0 Bucket 1 Bucket 2
Bucket 0 Bucket 1 Bucket 2
n keys:
pivot
Efficient u3liza3on of the shared memory
Memory read and write: 2n + 2n
Sample Sort on GPU
* Leischner, N., Osipov, V., & Sanders, P. (2010, April). GPU sample sort. InParallel & Distributed Processing (IPDPS), 2010 IEEE Interna2onal Symposium on (pp. 1-‐10). IEEE.
Performance of GPU-‐Accelerated Algorithms
Compression method Transform Reverse transform
BWT 4x 31x
MTF 8x 8x
Markov 18x N/A
Huffman 2x 5x
The improvement of the parallelism in BWT reverse
The intrinsic parallelism of the new designed Markov transform
CPU: Intel Xeon E5630 @2.53GHz GPU: Tesla M2050 / 3GB CUDA: 4.0
Compression method
Compression rate (MB/s)
Decompression rate (MB/s) Compression
ratio (%) M2050 K20c M2050 K20c
bzip2 (CPU) 8.24 26.64 24.31
gzip (CPU) 8.45 114.23 29.46
BWT+MTF+Huffman 16.04 21.95 73.90 83.66 31.06
Markov+Huffman 115.60 204.86 77.23 90.87 36.68
Huffman 179.83 215.85 128.39 142.16 60.50
This work 77.80 97.26 124.37 127.78 24.77
Compression Performance of FASTQ File
Similar compression ra.o 11.8x Speed up for compression 4.8x Speed up for decompression (Improvement possible with more work)
Compression method
Compression rate (MB/s)
Decompression rate (MB/s) Compression
ratio (%) M2050 K20c M2050 K20c
bzip2 (CPU) 6.27 24.63 26.71
gzip (CPU) 7.73 106.41 32.26
BWT+MTF+Huffman 15.99 21.78 74.14 80.04 32.55
Markov+Huffman 116.71 206.18 87.09 90.14 39.66
Huffman 177.11 222.22 127.48 144.49 57.69
This work 87.45 98.14 139.93 149.68 26.46
Compression Performance of SAM File
Similar compression ra.o 15.6x Speed up for compression 6.1x Speed up for decompression (Improvement possible with more work)
Compression method
Compression rate (MB/s)
Decompression rate (MB/s) Compression
ratio (%) M2050 K20c M2050 K20c
bzip2 (CPU) 8.93 22.61 38.47
gzip (CPU) 9.83 93.59 42.35
BWT+MTF+Huffman 9.14 12.38 57.01 75.18 43.86
Huffman 185.32 231.16 121.66 129.28 49.42
Markov+Huffman 97.75 176.65 83.67 88.10 42.01
Markov Transform for Quality Scores
Comparison to Domain-‐Specific Methods Compression
method Compression rate
(MB/s) Decompression
rate (MB/s) Compression
ratio (%)
gzip 12.2 45.4 35.35
bzip2 7.0 13.0 29.05
SCALCE 7.8 13.1 25.72
DSRC 13.5 32.2 24.77
quip 8.3 10.9 22.19
fasqz 4.6 3.8 21.95
fqzcomp 8.2 8.3 21.72
Seqsqueeze1 0.6 0.6 21.87
Column major block compression 111.0 104.4 29.46
23
Conclusion
• We presented adap.ve compression framework for genomics data accelerated by GPU, which works very well
• Column major compression
• Novel algorithm for data like quality score
• Generic and extensible • Compression on GPU is not easy, sor.ng is s.ll a bo\leneck
24
Contact: [email protected]