1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring...

Post on 17-Dec-2015

214 views 1 download

Tags:

Transcript of 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring...

1

Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3

EDBT 2010March 22-26 2010

Anchoring Millions of Distinct Reads on the Human Genome

Within Seconds

1IBM TJ Watson Research, NY, USA2IBM Zurich Research Laboratory, CH3 Thomas Jefferson University, USA

2

Introduction - Past• Bioinformatics Sequencing Evolution

Human Genome3.3B nucleotides825MB raw data20MB compressed25,000 genes

3

Introduction – Past, Present, Future

It took 13 years for teams of scientists around the globe to first read the human genome – completing the project in 2001.

In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson.

By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg.

“Science 2009”

4

Question from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genome

Reference genome

A C G T T A C G

20-75 nts

C G T T A C G

AC G ? AG T A

millions

New generation Sequencer

What we want to do

5

Applications• Cancer Research: help isolate cancer-initiating mutations• Better design of siRNA’s (short interfering RNA’s)

– RNA sequences of ~21 nucleotides -> RNA interference– Therapeutic gene silencing– Cancer therapy by tumor-related gene targeting

• Junk DNA analysis• Re-sequencing• etc

6

How can we achieve that?• Solve as hash join between the reference

and the query database– Reference Genome/Database

(eg human genome)• Static, one time

– Query Database (produced fragments)• Dynamic

recreated on every experiment– How fast can we achieve this

on a commodity desktop?

• We call the technique QPick (=Quick Pick)

7

ACGG…TTT

CGG…TTTA

GG…TTTAG

k-mer extraction

genomic sequence (the database)

acggtttaTTTAGGGGGCCAAAAAAATTT

cggtttatTTAGGGGGCCAAAAAAATTTA

head tail

key extraction &bit packetization

• • •

hashtablekey

bit packetization

cggtttat

aaaaaaaa

acaaaaaa

tttttttt

target database

011..011 010..111 011..111

000..011 001..001

101..000

……

ctttttat

ggtataaa

hashtable join

short-read generator

TTT?AGGGGGATAGAGAATTAAAA?TTAG?AATT?CC

output(with wildcards)

cggtttat

aaaaaaaa

agggaaaa

tttttttt

011..011 010..111

001..011 011..001

101..000…

ggtttgat

ggtttttt

head & tail separation

query set expansion(wildcards + reverse strand)

bit packetization / hashtable construction

query database

8

QPick - Advantages

• Disk Based. Small Memory Requirements– Other techniques run out of memory on large

datasets…

– 3GB RAM is sufficient to

a)Indexb)Search Human Genomec) Query Millions of Short Fragments

9

• Highly Parallelizable– Single CPU/Core are sufficient.

QPick - Advantages

10

A

• Flexibility – Rigid and wildcard matches

QPick - Advantages

AC G ? AG T A ?

wildcard wildcard

AC G AG T AC

query

AC G AG T AC AC G AG T AC

A C G T T A C G A C G

A C G T T

A C G T T A C G

T T

AG

G

reference genome……..

11

• Completeness of results– Various competitors failed to return all

matches

QPick - Advantages

A C G T T A C G

C G T T A C G

AC G ? AG T A

QPick does not miss any matches

12

Technical Highlights• Data Compression

– Exploit small DNA alphabet

• Fast bitwise operations– Take advantage of 64bit

word comparisons

• Simple implementation (hash based)

• Data Pruning

A C G T

…0011100011…

64bits

13

Data Representationreference genome

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

AC G AG T AC T T C

GAG AG T AC T T C

reference genome

window Lmax

14

Data Representation - Headreference genome

A G TC A G TC A G TC A G TC A G TC A G TC A C

A

G

T

C

00

01

10

11

Codebook

Binary: 0001101100011011000110110001Decimal: 28,422,577

head

window

position28,422,577 tail

00 01 10 11 00 01 10 11 00 01 10 11 00 01

tail

15

Advantages– Head = key for hashtable– Reduction in dataset size

• We don’t have to explicitly store the head

Data Representationreference genome

16

Data Representation - Tailreference genome

A

G

T

C

0001

0010

0100

1000

Codebook

. 0000

G T A G TC A G TC A Chead

tail

. . . .

padded toproper lengthtail: 16 nts 16x4bits = one 64bit word

0100 1000 etc … 0000 1

position of thewindow(how many shiftswe have made since the beginning ofthe string…)pattern p

query q[p & q = q]

17

Example

ATAGACTAAAAAAAAAAAAAAATT

reference genome

24 symbols

……

padding

30 symbols

atagactaaaaaaa AAAAAAAATT...... 1

tagactaaaaaaaa AAAAAAATT ....... 2

aaaaaaaaaaaaaa

aaaaaaaaaaaaaa

aaaaaaaaaaaaat

A T T . . . . . . . . . . ...

T T . . . . . . . . . . . . ..

T . . . . . . . . . . . . . . .

8

9

10

18

Up to now we have seen how to encode the reference genome

Now we show how to encode the query sequences

19

Encoding the query DBNow, instead of one long sequence, many shorter ones

A G TC A G TC .

A GC .

A G TC A G TC . .

A G TC G

millions

20

Encoding the query DBNow, instead of one long sequence, many shorter ones

A G TC A G TC .

A GC .

A G TC A G TC . .

A G TC G

millions

We do similar extractions in heads and tails

tailhead

16 symbols = 64bits14 symbols: key

21

Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards

(in the db only used for padding)

A G TC A G TC . .

22

Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards

(in the db only used for padding)

2. Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db)

A G TC A G TC . .

A GC .

A G TC A G TC . .

A G TC G

23

Encoding the query DB - wildcards

Treated differently depending where they appear

1. Head. Expanded to the possible symbols

A G TC .

A G TC

A G TC

A G TC

A G TC

A

G

T

C

24

Encoding the query DB - wildcards

Treated differently depending where they appear1. Head. Expanded to the possible symbols

2. Tail. Encoded as binary wildcard 0000

A G TC .

A G TC

A G TC

A G TC

A G TC

A

G

T

C

25

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

If we explicitly encode the reverse strand we would be indexing twice as much data

26

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

G G C C T T T

Form complementary sequences

A A A G G C C

27

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

G G C C T T T

Form complementary sequences

A A A G G C C

28

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

G G C C T T T

Form complementary sequences

A A A G G C C

C C T

GGA

29

Overall search process….

30

Overall search process….

31

Overall search process….

32

Was our hash key fuction a correct one?…a bad key function would have led to many collisions…

YES! 98.5% of buckets contain less than 10 entries

33

Homo sapiens

Experiments• Comparison with other Short Sequence

Anchoring Tools• QPick Performance

– Number of Wildcards– Index Creation Time– Query Time

• Datasets

ftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/

ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/

Mus Musculus

34

Up to 60x faster …

Tool Hits Misses Time (sec) ImprovementQPick 10,611,

8580 79 -

BWA 10,611,858

0 149 1.8x

fetchGWI

10,611,856

2 186 2.3x

Bowtie 10,611,858

0 331 4.1x

SOAP 10,573,528

38,330 4728 59.84x

Eland 10,559,799

5,014,059

555 7.0x

COMPARISONShort Sequence Anchoring Tools

36

Comparison with Hash-Based techniques

QPick – 16x faster than FetchGWI

37

Varying the wildcards• 5-6 seconds retrieval time for 0-1 wildcards• <60 sec for up to 4 wildcards

38

• Time to index– Reference Genome– Query sequences

• Time to Search– 1M- 10M queries (short DNA fragments)

Full Human-Genome Search

39

Full Genome Search – Index Time

~ 7 hours (on single core CPU)

< 50 sec for query hashtables

40

Full Genome Search – Search Time

• ~130sec to find 1M matches (single core)

41

Conclusions• QPick: Fast and complete search for short

sequence fragments• Takes Advantage of:

– Small DNA alphabet– Bit packetization– Hash Joins

• Up to 60x faster the competitive techniques• Applications for ….• Future…

A C G T

…0011100011…