GPU Accelerated Genomics Data Compression · 2014. 4. 10. ·...

25
GPU Accelerated Adap.ve Compression Framework for Genomics Data GuiXin Guo, Shuang Qiu, ZhiQiang Ye, BingQiang Wang BGI Research, Shenzhen, China Mian Lu Ins2tute of HPC, A*STAR, Singapore Simon See BGINVIDIA Joint Innova2on Lab, Shenzhen, China GTC2014 March 2427, 2014 San Jose, CA Contact [email protected] [email protected]

Transcript of GPU Accelerated Genomics Data Compression · 2014. 4. 10. ·...

Page 1: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

GPU  Accelerated  Adap.ve  Compression  Framework  for  Genomics  Data

GuiXin  Guo,  Shuang  Qiu,  ZhiQiang  Ye,  BingQiang  Wang  BGI  Research,  Shenzhen,  China

Mian  Lu    Ins2tute  of  HPC,  A*STAR,  Singapore

Simon  See    BGI-­‐NVIDIA  Joint  Innova2on  Lab,  Shenzhen,  China

GTC2014  March  24-­‐27,  2014  San  Jose,  CA  

Contact  [email protected]  [email protected]  

 

Page 2: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Outline

Ø Introduction  Ø Adaptive  Compression  Framework  Ø Implementation  of  Compression  Algorithms  Ø Results  Ø Conclusion  

Page 3: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Genomics  Data:  Exponen.al  Growth  

(a)  Cost  of  sequencing  per  M(base)  DNA  sequence  has  dropped  from  nearly  $6000  in  2001  to  slightly  more  than  $0.10  in  2011.  

(b)  Total  number  of  completed  sequence  genomes  has  grown  exponen.ally  with  decreasing  sequencing  costs.  

             Boyle,  Nane\e  R.,  and  Gill,  Ryan  T.  "Tools  for  genome-­‐wide  strain  design  and  construc.on."  Current  Opinion  in  Biotechnology  23.5  (2012):  666-­‐671.  

Moore’s  Law  for  Chips  2x  performance  per  18  months    Moore’s  Law  for  Genomics  10x  data  output  per  18  months  

Page 4: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

•  Challenges  proposed  for  storing  and  processing  of  huge  volume  of  genomics  data  

•  BGI  as  an  example  –  Tens  of  TBs  data  generated  per  day  –  Tens  of  PBs  storage  (several  sites)  –  Ten  fold  in  the  (not  too  far)  future  

•  Observa.on  –  Computa.on  in  genomics  features  a  much  lower  computa2on/IO  ra2o  than  classical  HPC  workloads  

–  IO  (or  data  movement)  becomes  more  expensive  than  computa.on  

Can  Compression  Help?  

Page 5: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Compression

•  Benefits  of  compression  –  Reduce  storage  capacity  (especially  for  archiving)  

–  Reduce  IO  bandwidth  (more  balanced  compu.ng  systems  architecture)  

–  And,  of  course,  save  $$$  

•  Compression  is  NOT  for  free  –  Squeeze  more,  compute  more  

–  Squeeze  less,  compute  less  

•  Can  GPU  help?  

Page 6: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

sequence  iden3fier

sequence  bases

sequence  iden3fier

quality  scores

Take  a  Look  at  Genomics  Data  Files

•   Two  common  characteris.cs  of  genomics  data  files        1.  Table  contains  mul.ple  rows  and  columns  

 

 

 

 

 

 

       2.  Data  in  the  same  column  are  with  similar  characteris.cs

@SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 GAATAAAGAAAAAATGGAAAACGAAGATGTTGAAATTTTTAATGATTATA +SRR003092.1.1 3046HAAXX:2:1:933:35.1 length=51 I>I:1III9?9&I+II.6*,:'*1.?I%-&&67I0(1.",&$%2,+I4)+ @SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 GTATACGTATTATGAATATACTGATTATATAAGCATAAATAAATAAAATA +SRR003092.2.1 3046HAAXX:2:1:942:57.1 length=51 IIIIIIIIIIIIIIIDIAI8%I-7II9I3I8@(%/EIA/>;G=DI9=8#6

Example  of  a  FASTQ  file  containing  two  sequences.  

Sequence  iden.fier  Sequence  bases  

Sequence  iden.fier  Quality  scores

Column  major  table  view

Sequence  iden.fier  Sequence  bases  

Sequence  iden.fier  Quality  scores

Page 7: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Input Output

GPU  Op3mized  Compression  Algorithms

Compression  Schemes  (Combina3on  of  Algorithms)

Mul3ple  rounds  of  processing

Compress  each  column

Column  Major  Compression  Engine

Transform  to  column  major

Test  and  apply  the  best  scheme  against  each  column

Block  #i

Workflow  of  Adap.ve  Compression  Framework

Page 8: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Algorithms  Op.mized  (Till  Now)

BWT MTF Markov  Transform

Huffman LZ77 GPU  Accelerated

Transformational  compression  schemes

Substitutional  compression  schemes

Statistical  Model-­‐based  compression  

schemes

Commonly  used  compression  schemes

Novel compression algorithm for quality scores (FASTQ)

First-order Markov model

Sorting frequencies of character pairs

Statistical scheme

Transformational sheme & &

Typical basic algorithms

Page 9: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

e.g.  quality  scores e.g.  sequence  ID e.g.  DNA  sequence

Raw  genomics  data

Data  with  many  similar  strings    

Data  with  limited  alphabet  

Random  distributed  data Text-­‐like  data

……

LZ77

Huffman MTF  

BWT

Huffman

Markov  transform

Huffman

Huffman

Four  Schemes  for  Different  Data Generic  compression  methods?    

Domain-­‐specific  methods?    Not  efficient Works  only  on  limited  data  formats

Problem still remains: Serial algorithms too slow

Column-major compression: • Flexible for new file formats • Extensible for new algorithms

Tested for best performance

Page 10: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Op.miza.on  Techniques

Ø   Data  parallel  

 

Input  Data  /  Output  Data

…… Split  to  different  block

Compressed  Data  1

…… Algorithm  1

Compressed  Data  2

Compressed  Data  n

Data  Block  1 Data  Block  2 Data  Block  n

Transformed    Data  1

Transformed    Data  2

Transformed    Data  2

Algorithm  k

Merge  different  block

Reverse    algorithm  1

Reverse    algorithm  k

……

Simple  but  efficient  scheme  to  parallelize  MTF  and  its  Reverse

Ø Increase  the  parallelism  of  selected  algorithms  • (Slightly)  Alternate  implementa.on  of  the  algorithms  to  reduce  data  dependency  

Ø Op.mize  the  implementa.on  on  GPU  • Embrace  state  of  the  art,  high  performance  libraries  (e.g.  b40c)  • Be\er  u.liza.on  of  constant  memory  and  shared  memory  

Page 11: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Parallel  Huffman  Encoding  and  Decoding   Input  Data

Single-­‐side  Growing  Huffman  Tree  

C0 C1 C2 …… Ck Codeword  &  Code  length

L0 L1 L2 …… Lk

Constant  Memory

Posi.on  Array  

Encoded  data:

Encode characters in parallel

d cb

a

d cb

a

Huffman  Tree

Fixed  rela.on  between  codeword  and  code  length Shared  Memory

h1 h2 h2 … … hr r-­‐bit  string  S thread  1

thread  2 thread  3

thread  d d  =  the  depth  of  the  Huffman  tree

...  ...

Decode the encoded string S in parallel with d GPU threads for each character

generate

Memory  efficient Parallel  Huffman  decoding

Serial  in  Huffman  decoding

Auxiliary  tables

stored

Page 12: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

0 1 2 3

0 (0,0) (0,1) (0,2) (1,3)

1 (0,0) (0,1) (3,2) (3,3)

2 (1,0) (0,1) (4,2) (2,3)

3 (0,0) (5,1) (1,2) (0,3)

0 1 2 3

0 (1,3) (0,0) (0,1) (0,2)

1 (3,2) (3,3) (0,0) (0,1)

2 (4,2) (2,3) (1,0) (0,1)

3 (5,1) (1,2) (0,0) (0,3)

0 1 2 3

0 3 0 1 2

1 2 3 0 1

2 2 3 0 1

3 1 2 0 3

Markov  Transform (A,B):  A  stores  the  frequency  of  a  pair                        B  represents  the  second  character    

Use  the  adjacent  characters  in  the  input  to  form  character  pairs  and  count  the  frequency  of  each  pair

Use  the  frequency  to  sort  each  row  in  the  table

Coded    String:        1  1  1  1  0  0  2  0  0  1  0  0  0  1  0  1  0  0  0  0  0

String:  1  3  2  3  1  2  0  3  1  3  1  2  2  3  1  3  1  2  2  2  2 …  …

Lookup  the  table:  use  the  previous  character  as  the  row  index,  search  the  current  character  to  get  its  index  in  the  column,    and  take  the  index  as  its  coding  value

Lookup  table     only  characters  stored  

1  3  2  …

search

Parallelism on GPU by using atomicAdd

One row is sorted by one block of thread

Each character can be parallel processed

Quality  score Locally  alike High  data  redundancy Hard  to  compress

Markov  transform Lightweight High  parallelism

Good  solu.on

We propose

Page 13: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

-20

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9Co

mpression

 rate  (M

B/s)

Compression  level

gzip

bzip2

lzip

lzo

Compression  rate for  fastq  file  

0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5 6 7 8 9

Compression

 ratio(

%)

Compression  level

Compression  ratio  for  fastq  file

gzip

bzip2

lzip

lzo

bzip2:    good  &  stable  compression  ra3o,  but  low  compression  rate Data  dependency  leads  to  difficul3es  for  GPU  accelera3on!  

bzip2:  Challenge  for  GPU  Accelera.on

Patel,  R.  A.,  Zhang,  Y.,  Mak,  J.,  Davidson,  A.,  &  Owens,  J.  D.  (2012).  Parallel  lossless  data  compression  on  the  gpu  (pp.  1-­‐9).  IEEE.

Page 14: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Data  (represented  as  a  string  of  length  N)

Burrows-­‐Wheeler  Transform

Move-­‐to-­‐Front  Transform

Huffman  Coding

Compressed  Data

a  b  a  b  a  c  a  b  a  c

c  c  b  b  b  a  a  a  a  a

99  0  99  0  0  99  0  0  0  0

1  0  1  0  0  1  0  0  0  0  

Example

BW-­‐transformed  string  of  length  N

N-­‐sized  byte  array  of  indices  to  MTF  list

M-­‐sized  bit  string  of  encoded  data

Compressio

n

Decompression

Most time-consuming!!!

bzip2:  Workflow Bzip2-­‐like  Compression  Method  Pipeline    

Page 15: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

a b a b a c a b a c c a b a b a c a b a a c a b a b a c a b b a c a b a b a c a a b a c a b a b a c c a b a c a b a b a a c a b a c a b a b b a c a b a c a b a a b a c a b a c a b b a b a c a b a c a

a b a b a c a b a c a b a c a b a b a c a b a c a b a c a b a c a b a b a c a b a c a b a c a b a b b a b a c a b a c a b a c a b a b a c a b a c a b a c a b a c a b a b a c a b a c a b a c a b a b a

Rota.on Sor.ng

String:  a  b  a  b  a  c  a  b  a  c

BWT  string

Index  0

Most compute intensive

Increase  Parallelism  of  BWT  Burrows-­‐Wheeler  Transforma3on  

0   1 0   1 0 2 0 1 0 2

0 3 1 4 2 5 1 4 2 5

0 2 0 2 1 3 0 2 1 3

0 5 2 7 4 9 1 6 3 8

R1

R2

R4

R8

S

SA

0 1 2 3 4 5 6 7 8 9

a b a     b   a c   a b a c

0   1 0   1 0 2 0 1 0 2

0 2 0 2 1 3 0 2 1 3

0 3 1 4 2 5 1 4 2 5

0 5 2 7 4 9 1 6 3 8

0 6 2 8 4 1 7 3 9 5

Prefix  doubling*

Radix  sor.ng

Use  a  high  performance    sor.ng  library  b40c  to  sort  the  rank  array

Transform  the  rank  array  to  suffix  array  in  

parallel Get  the  result  of  BWT  in  

parallel

*  Sun,  Weidong,  and  Zongmin  Ma.  "Parallel  lexicographic  names  construc.on  with  CUDA."  Parallel  and  Distributed  Systems  (ICPADS),  2009  15th  Interna2onal  Conference  on.  IEEE,  2009.

Page 16: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

(a) c c b b b a a a a a (d) a a a a a b b b c c 0 1 2 3 4 5 6 7 8 9 5 6 7   8 9 2 3 4 0 1

(b) a a a a a b b b c c (e) a a a a a b b b c c 5 6 7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1

(c) a a a a a b b b c c (f) a a a a a b b b c c 5 6   7 8 9 2 3 4 0 1 5 6 7 8 9 2 3 4 0 1

(d) a a a a a b b b c c 5 6   7 8 9 2 3 4 0 1

Sor3ng

a c

ca ab

cab aba

caba abab

ababa cabac

ababacabac

Different  threads  start  at  different  posi.ons  for  the  BWT  reverse  simultaneously  

Improve  Parallelism  of  BWT  Reverse

More indices are stored in the BWT process to parallelize the BWT reverse

Sort  the  BWT  string

c c b b b a a a a a

0 6 2 8 4 1 7 3 9 5

BWT  string

SA  (suffix  array) Index  0 Index  5

BWT  Reverse c a ba a c ba ba

Backward reverse: one character by another from a start position

Serial  in  nature!

Solu3on

Sorting plays its role!

Page 17: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

d-­‐bits …  … d-­‐bitsk-­‐bit  keys:

key 2 key 7 key 1 key 3   key 9 key 5 … … key m key 8

key 1 key 2 key 3 key 4   … … key m … … key n

count  1 count 2 … … count  2d

key 5 key 7 key 1 key n   key 8 key 4 … … key m key 3

count  1 count 2 … … count  2d

key 7 key 1 key 9 key 5   key 3 key n … … key 2 key m

……

Bucket  1   Bucket  2   Bucket  2droun

d  1

roun

d  r

Bucket  1   Bucket  2   Bucket  2d

Radix  sort

Radix  sort  implemented  by  b40c  is  memory  bandwidth  bounded. Sor3ng  is  s3ll  the  boileneck!  

Radix  Sort  on  GPU  

Round  of  sor3ng:    r  =  k/d    Memory  read  and  write:    (2n+n)*r  

Page 18: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

15 12 5 9 2 17 13 8 11 3 6 18 7 1 14 10 16 4

3 7 9 10 13 15

15 9 13 3 7 10

5 2 3 6 1 4 12 9 8 11 7 10 15 17 13 18 14 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Part  0

Select  the  samples

Sort  the  samples

Select  pivot  &  Sca\er    keys  to  the  buckets

Locally  sort  each  bucket

Part  1 Part  2 Sample  Sort  on  GPU*

Bucket  0 Bucket  1 Bucket  2

Bucket  0 Bucket  1 Bucket  2

n  keys:

pivot

Efficient  u3liza3on  of  the  shared  memory

Memory  read  and  write:  2n  +  2n

Sample  Sort  on  GPU  

*  Leischner,  N.,  Osipov,  V.,  &  Sanders,  P.  (2010,  April).  GPU  sample  sort.  InParallel  &  Distributed  Processing  (IPDPS),  2010  IEEE  Interna2onal  Symposium  on  (pp.  1-­‐10).  IEEE.

Page 19: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Performance  of  GPU-­‐Accelerated  Algorithms

Compression method Transform Reverse transform

BWT 4x 31x

MTF 8x 8x

Markov 18x N/A

Huffman 2x 5x

The  improvement  of  the  parallelism  in  BWT  reverse

The  intrinsic  parallelism  of  the  new  designed  Markov  transform

CPU:  Intel  Xeon  E5630  @2.53GHz  GPU:  Tesla  M2050  /  3GB  CUDA:  4.0  

Page 20: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Compression method

Compression rate (MB/s)

Decompression rate (MB/s) Compression

ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 8.24 26.64 24.31

gzip (CPU) 8.45 114.23 29.46

BWT+MTF+Huffman 16.04 21.95 73.90 83.66 31.06

Markov+Huffman 115.60 204.86 77.23 90.87 36.68

Huffman 179.83 215.85 128.39 142.16 60.50

This work 77.80 97.26 124.37 127.78 24.77

Compression  Performance  of  FASTQ  File

Similar  compression  ra.o  11.8x  Speed  up  for  compression  4.8x  Speed  up  for  decompression    (Improvement  possible  with  more  work)  

Page 21: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Compression method

Compression rate (MB/s)

Decompression rate (MB/s) Compression

ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 6.27 24.63 26.71

gzip (CPU) 7.73 106.41 32.26

BWT+MTF+Huffman 15.99 21.78 74.14 80.04 32.55

Markov+Huffman 116.71 206.18 87.09 90.14 39.66

Huffman 177.11 222.22 127.48 144.49 57.69

This work 87.45 98.14 139.93 149.68 26.46

Compression  Performance  of  SAM  File

Similar  compression  ra.o  15.6x  Speed  up  for  compression  6.1x  Speed  up  for  decompression    (Improvement  possible  with  more  work)  

Page 22: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Compression method

Compression rate (MB/s)

Decompression rate (MB/s) Compression

ratio (%) M2050 K20c M2050 K20c

bzip2 (CPU) 8.93 22.61 38.47

gzip (CPU) 9.83 93.59 42.35

BWT+MTF+Huffman 9.14 12.38 57.01 75.18 43.86

Huffman 185.32 231.16 121.66 129.28 49.42

Markov+Huffman 97.75 176.65 83.67 88.10 42.01

Markov  Transform  for  Quality  Scores

Page 23: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Comparison  to  Domain-­‐Specific  Methods Compression

method Compression rate

(MB/s) Decompression

rate (MB/s) Compression

ratio (%)

gzip 12.2 45.4 35.35

bzip2 7.0 13.0 29.05

SCALCE 7.8 13.1 25.72

DSRC 13.5 32.2 24.77

quip 8.3 10.9 22.19

fasqz 4.6 3.8 21.95

fqzcomp 8.2 8.3 21.72

Seqsqueeze1 0.6 0.6 21.87

Column major block compression 111.0 104.4 29.46

23

Page 24: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Conclusion  

• We  presented  adap.ve  compression  framework  for  genomics  data  accelerated  by  GPU,  which  works  very  well  

• Column  major  compression  

• Novel  algorithm  for  data  like  quality  score  

• Generic  and  extensible  • Compression  on  GPU  is  not  easy,  sor.ng  is  s.ll  a  bo\leneck  

24

Page 25: GPU Accelerated Genomics Data Compression · 2014. 4. 10. · Bzip2Qlike’Compression’Method’Pipeline ... GPU Accelerated Genomics Data Compression GuiXin Guo ...

Contact:    [email protected]