1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring...

Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3

EDBT 2010March 22-26 2010

Anchoring Millions of Distinct Reads on the Human Genome

Within Seconds

1IBM TJ Watson Research, NY, USA2IBM Zurich Research Laboratory, CH3 Thomas Jefferson University, USA

Introduction - Past• Bioinformatics Sequencing Evolution

Human Genome3.3B nucleotides825MB raw data20MB compressed25,000 genes

Introduction – Past, Present, Future

It took 13 years for teams of scientists around the globe to first read the human genome – completing the project in 2001.

In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson.

By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg.

“Science 2009”

Question from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genome

Reference genome

A C G T T A C G

20-75 nts

C G T T A C G

AC G ? AG T A

millions

New generation Sequencer

What we want to do

Applications• Cancer Research: help isolate cancer-initiating mutations• Better design of siRNA’s (short interfering RNA’s)

– RNA sequences of ~21 nucleotides -> RNA interference– Therapeutic gene silencing– Cancer therapy by tumor-related gene targeting

• Junk DNA analysis• Re-sequencing• etc

How can we achieve that?• Solve as hash join between the reference

and the query database– Reference Genome/Database

(eg human genome)• Static, one time

– Query Database (produced fragments)• Dynamic

recreated on every experiment– How fast can we achieve this

on a commodity desktop?

• We call the technique QPick (=Quick Pick)

ACGG…TTT

CGG…TTTA

GG…TTTAG

k-mer extraction

genomic sequence (the database)

acggtttaTTTAGGGGGCCAAAAAAATTT

cggtttatTTAGGGGGCCAAAAAAATTTA

head tail

key extraction &bit packetization

• • •

hashtablekey

bit packetization

cggtttat

aaaaaaaa

acaaaaaa

tttttttt

target database

011..011 010..111 011..111

000..011 001..001

101..000

……

ctttttat

ggtataaa

hashtable join

short-read generator

TTT?AGGGGGATAGAGAATTAAAA?TTAG?AATT?CC

output(with wildcards)

cggtttat

aaaaaaaa

agggaaaa

tttttttt

011..011 010..111

001..011 011..001

101..000…

ggtttgat

ggtttttt

head & tail separation

query set expansion(wildcards + reverse strand)

bit packetization / hashtable construction

query database

QPick - Advantages

• Disk Based. Small Memory Requirements– Other techniques run out of memory on large

datasets…

– 3GB RAM is sufficient to

a)Indexb)Search Human Genomec) Query Millions of Short Fragments

• Highly Parallelizable– Single CPU/Core are sufficient.

QPick - Advantages

• Flexibility – Rigid and wildcard matches

QPick - Advantages

AC G ? AG T A ?

wildcard wildcard

AC G AG T AC

AC G AG T AC AC G AG T AC

A C G T T A C G A C G

A C G T T

A C G T T A C G

reference genome……..

• Completeness of results– Various competitors failed to return all

matches

QPick - Advantages

A C G T T A C G

C G T T A C G

AC G ? AG T A

QPick does not miss any matches

Technical Highlights• Data Compression

– Exploit small DNA alphabet

• Fast bitwise operations– Take advantage of 64bit

word comparisons

• Simple implementation (hash based)

• Data Pruning

A C G T

…0011100011…

64bits

Data Representationreference genome

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

AC G AG T AC T T C

GAG AG T AC T T C

reference genome

window Lmax

Data Representation - Headreference genome

A G TC A G TC A G TC A G TC A G TC A G TC A C

Codebook

Binary: 0001101100011011000110110001Decimal: 28,422,577

window

position28,422,577 tail

00 01 10 11 00 01 10 11 00 01 10 11 00 01

Advantages– Head = key for hashtable– Reduction in dataset size

• We don’t have to explicitly store the head

Data Representationreference genome

Data Representation - Tailreference genome

Codebook

. 0000

G T A G TC A G TC A Chead

. . . .

padded toproper lengthtail: 16 nts 16x4bits = one 64bit word

0100 1000 etc … 0000 1

position of thewindow(how many shiftswe have made since the beginning ofthe string…)pattern p

query q[p & q = q]

Example

ATAGACTAAAAAAAAAAAAAAATT

reference genome

24 symbols

……

padding

30 symbols

atagactaaaaaaa AAAAAAAATT...... 1

tagactaaaaaaaa AAAAAAATT ....... 2

aaaaaaaaaaaaaa

aaaaaaaaaaaaat

A T T . . . . . . . . . . ...

T T . . . . . . . . . . . . ..

T . . . . . . . . . . . . . . .

Up to now we have seen how to encode the reference genome

Now we show how to encode the query sequences

Encoding the query DBNow, instead of one long sequence, many shorter ones

A G TC A G TC .

A GC .

A G TC A G TC . .

A G TC G

millions

Encoding the query DBNow, instead of one long sequence, many shorter ones

A G TC A G TC .

A GC .

A G TC A G TC . .

A G TC G

millions

We do similar extractions in heads and tails

tailhead

16 symbols = 64bits14 symbols: key

Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards

(in the db only used for padding)

A G TC A G TC . .

Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards

(in the db only used for padding)

2. Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db)

A G TC A G TC . .

A GC .

A G TC A G TC . .

A G TC G

Encoding the query DB - wildcards

Treated differently depending where they appear

1. Head. Expanded to the possible symbols

A G TC .

A G TC

Encoding the query DB - wildcards

Treated differently depending where they appear1. Head. Expanded to the possible symbols

2. Tail. Encoded as binary wildcard 0000

A G TC .

A G TC

Handling Forward & Reverse DNA strands

reference genome

backward strand

forward strand

If we explicitly encode the reverse strand we would be indexing twice as much data

reference genome

backward strand

forward strand

G G C C T T T

Form complementary sequences

A A A G G C C

reference genome

backward strand

forward strand

G G C C T T T

A A A G G C C

reference genome

backward strand

forward strand

G G C C T T T

A A A G G C C

Overall search process….

Was our hash key fuction a correct one?…a bad key function would have led to many collisions…

YES! 98.5% of buckets contain less than 10 entries

Homo sapiens

Experiments• Comparison with other Short Sequence

Anchoring Tools• QPick Performance

– Number of Wildcards– Index Creation Time– Query Time

• Datasets

ftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/

ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/

Mus Musculus

Up to 60x faster …

Tool Hits Misses Time (sec) ImprovementQPick 10,611,

8580 79 -

BWA 10,611,858

0 149 1.8x

fetchGWI

10,611,856

2 186 2.3x

Bowtie 10,611,858

0 331 4.1x

SOAP 10,573,528

38,330 4728 59.84x

Eland 10,559,799

5,014,059

555 7.0x

COMPARISONShort Sequence Anchoring Tools

Comparison with Hash-Based techniques

QPick – 16x faster than FetchGWI

Varying the wildcards• 5-6 seconds retrieval time for 0-1 wildcards• <60 sec for up to 4 wildcards

• Time to index– Reference Genome– Query sequences

• Time to Search– 1M- 10M queries (short DNA fragments)

Full Human-Genome Search

Full Genome Search – Index Time

~ 7 hours (on single core CPU)

< 50 sec for query hashtables

Full Genome Search – Search Time

• ~130sec to find 1M matches (single core)

Conclusions• QPick: Fast and complete search for short

sequence fragments• Takes Advantage of:

– Small DNA alphabet– Bit packetization– Hash Joins

• Up to 60x faster the competitive techniques• Applications for ….• Future…

A C G T

…0011100011…

1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring...

Documents

Transcript of 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring...

Demetrios Vlachos Painter (Nx Power Lite)

Electronic Cigarette Invention 2001 pattent search- Stephane Marc VLACHOS

CV N. Vlachos - aretaieio-obgyn.com · CV N. Vlachos 18) Bracero N, Jurema M, Vlahos N, Kolp L, Garcia J. “Polycystic ovarian syndrome (PCOS) patients have a favorable response

Bonnet EDBT SS 07

Data Integration in Bioinformatics and Life Sciencesdbs.uni-leipzig.de/file/EDBT-school2007-rahm.pdf · E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer

Advances in Database Technology { EDBT 2010 · Advances in Database Technology { EDBT 2010 13th International Conference on Extending Database Technology Lausanne, Switzerland, March

CURRICULUM VITAE Dionisios (Dion) G. Vlachos Department of ...

CURRICULUM VITAE Dionisios (Dion) G. Vlachos G. Vlachos-CV 1 CURRICULUM VITAE Dionisios (Dion) G. Vlachos Department of Chemical and Biomolecular Engineering Tel. (302) 831- 2830 Center

Orthodox Spirituality- Metropolitan Hierotheos Vlachos·

Leda Andoniou and Kosmas Vlachos Hellenic Open University.

The Emergence of Pattern Discovery Techniques in ...dimacs.rutgers.edu/Workshops/MiningTutorial/rigoutsos-paper1.pdf · The Emergence of Pattern Discovery Techniques in ... fragments)

Apollo: Learning Query Correlations for Predictive Caching ...openproceedings.org/2018/conf/edbt/paper-100.pdf · International Conference on Extending Database Technology (EDBT),

EDBT 2006 edbt EDBT 2006 International Conference on Database · 2016. 10. 12. · edbt Conference Theme Ever since its early days, database technology has been challenged and advanced

An asynchronous CMC project: The “Euro e-pals” Vlachos Kosmas.

Panos vlachos

The Illness Cure of the Soul- Metropolitan Hierotheos Vlachos·

Vlachos, Law, Sin and Death

Savvas Vlachos / Director - interregeurope.eu · SAVVAS VLACHOS Environment officer & EU programs Coordinator HARRIS KORDATOS Mechanical Engineer & Certified Passivhaus Designer GIANNIS

A Night in the Desert of the Holy Mountain- Metropolitan Hierotheos Vlachos·

Dictionary-driven protein annotationweb.mit.edu/10.555/www/notes/NAR_BD_Annotation_September_20… · Dictionary-driven protein annotation Isidore Rigoutsos*, Tien Huynh, Aris Floratos1,