Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday , June 10, 2009

32
Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday, June 10, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3. 0/us/ for details

description

Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science. Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday , June 10, 2009. - PowerPoint PPT Presentation

Transcript of Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday , June 10, 2009

Page 1: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science

Jimmy Lin, Michael Schatz, and Ben LangmeadUniversity of Maryland

Wednesday, June 10, 2009

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Page 2: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Cloud Computing @ Maryland Teaching

Cloud computing course (version 1.0): Spring 2008Part of the Google/IBM Academic Cloud Computing Initiative

Cloud computing course (version 2.0): Fall 2008Sponsored by Amazon Web Services through a teaching grant

Research Web-scale text processing Statistical machine translation Bioinformatics

Page 3: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Maria

no dio

una

bofetada

bruja

verde

Mary

did not

a

slap

witch

green

green witchbruja verde

Learning Translation Models

Prodi ha erigido hoy un verdadero muro contra esas acciones, espero que el Sr. Moscovici lo haya comprendido bien, y realmente también espero que esta tendencia se rompa en los Consejos de Biarritz y de Niza, y se rectifique.

Mr Prodi has put an emphatic stop to this kind of action, which has hopefully resonated with Mr Moscovici, and I truly hope that this trend can be broken and reversed at the Councils in Nice and Biarritz.

Esas negociaciones sabemos que son muy difíciles y hacen temer un fracaso o un acuerdo de mínimos en Niza, lo que sería aún más grave y usted ya lo ha dicho, señor Ministro.These are, as we know, very tricky negotiations and raise fears of a setback or a watered-down agreement in Nice which, as you have already acknowledged, Mr Moscovici, would be even more serious.

We built systems for “learning” translation models in Hadoop…… sort of like the word count example, but with more math

Page 4: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Maria no dio una bofetada a la bruja verde

Mary not

did not

no

did not give

give a slap to the witch green

slap

a slap

to the

to

the

green witch

the witch

by

Example from Koehn (2006)

slap

Translation as a “Tiling” Problem

Page 5: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

From Text to DNA Sequences Text processing: [0-9A-Za-z]+ DNA sequence processing: [ATCG]+

Easier, right?(Nope, not really)

Michael Schatz (Ph.D. student, Computer Science; Spring 2008)Ben Langmead (M.S. student, Computer Science; Fall 2008)

Page 6: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Analogy(And two disclaimers)

Page 7: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Strangely-Formatted Manuscript Dickens: A Tale of Two Cities

Text written on a long spool

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Page 8: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

… With Duplicates Dickens: A Tale of Two Cities

“Backup” on four more copies

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Page 9: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Shredded Book Reconstruction Dickens accidently shreds the manuscript

How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k

fragments The short fragments from every copy are mixed together Some fragments are identical

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Page 10: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Overlaps

Generally prefer longer overlaps to shorter overlaps

In the presence of error, we might allow the overlapping fragments to differ by a small amount

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

it was the age of

was the age of foolishness,

the worst of times, it

It was the best of

was the best of times,4 word overlap

It was the best of

of times, it was the1 word overlap

It was the best of of wisdom, it was the

1 word overlap

Page 11: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Greedy Assembly

The repeated sequence makes the correct reconstruction ambiguous

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

of times, it was the

times, it was the age

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

it was the age of

was the age of foolishness,

the worst of times, it

Page 12: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

The Real Problem(The easier version)

Page 13: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

GATGCTTACTATGCGGGCCCCCGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGCAATGCTTAGCTATGCGGGC

AATGCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT

CGGTCTAGATGCTTACTATGCAATGCTTACTATGCGGGCCCCTTCGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

?Subject genome

Sequencer

Reads

Page 14: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

DNA Sequencing

ATCTGATAAGTCCCAGGACTTCAGT

GCAAGGCAAACCCGAGCCCAGTTT

TCCAGTTCTAGAGTTTCACATGATC

GGAGTTAGTAAAAGTCCACATTGAG

Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG

Bacteria: ~5 million bp Humans: ~3 billion bp

Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)

Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al,

2009) Recent studies of entire human genomes have used

3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads

~144 GB of compressed sequence data

Page 15: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

How do we put humpty dumpty back together?

Page 16: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Human Genome

11 years, cost $3 billion… your tax dollars at work!

A complete human DNA sequence was published in 2003, marking the end of the Human Genome Project

Page 17: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

CGGTCTAGATGCTTAGCTATGCGGGCCCCTT

Reference sequence

Alignment

GCTTA T CTAT

TTA T CTATGC

A T CTATGCGGA T CTATGCGG

GCTTA T CTAT

TCTAGATGCT

CTATGCGGGCCTAGATGCTT

A T CTATGCGGCTATGCGGGC

A T CTATGCGG

Subject reads

Page 18: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

CGGTCTAGATGCTTATCTATGCGGGCCCCTT

GCTTATCTATTTATCTATGC

ATCTATGCGGATCTATGCGG

GCTTATCTAT GGCCCCTTGCCCCTT

CCTT

CGGCGGTCCGGTCTCGGTCTAG

TCTAGATGCTCTATGCGGGCCTAGATGCTT

CTT

ATGCGGGCCC

Reference sequence

Subject reads

Page 19: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Reference: ATGAACCACGAACACTTTTTTGGCAACGATTTAT…Query: ATGAACAAAGAACACTTTTTTGGCCACGATTTAT…

Insertion Deletion Mutation

Page 20: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

1. Map: Catalog K-mers• Emit every k-mer in the genome and non-overlapping k-mers in the reads• Non-overlapping k-mers sufficient to guarantee an alignment will be found

CloudBurst

Human chromosome 1

Read 1

Read 2

Map

2. Shuffle: Coalesce Seeds• Hadoop internal shuffle groups together k-mers shared by the reads and the reference• Conceptually build a hash table of k-mers and their occurrences

shuffle

3. Reduce: End-to-end alignment• Locally extend alignment beyond seeds by computing “match distance”• If read aligns end-to-end, record the alignment

Reduce

Read 1, Chromosome 1, 12345-12365

Read 2, Chromosome 1, 12350-12370

Page 21: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

0 2000000 4000000 6000000 80000000

200040006000

8000

100001200014000

16000Running Time vs Number of Reads on Chr 1

01234

Millions of Reads

Runt

ime

(s)

0 100000020000003000000400000050000006000000700000080000000

500

1000

1500

2000

2500

3000

Running Time vs Number of Reads on Chr 22

01234

Millions of Reads

Runt

ime

(s)

Results from a small, 24-core cluster, with different number of mismatches

Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.

Page 22: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

24 48 72 960

200400600800

10001200140016001800

Running Time on EC2 High-CPU Medium Instance Cluster

Number of Cores

Runn

ing

time

(s)

CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on EC2

Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.

Page 23: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

What’s Next?(Michael Schatz’s Ph.D. dissertation)

Page 24: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Wait, no reference?

Page 25: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

de Bruijn Graph Construction Dk = (V,E)

V = All length-k subfragments (k > l) E = Directed edges between consecutive subfragments

Nodes overlap by k-1 words

Locally constructed graph reveals the global sequence structure

Overlaps implicitly computed

It was the best was the best ofIt was the best of

Original Fragment Directed Edge

de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001

Page 26: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

de Bruijn Graph Assembly

the age of foolishness

It was the best

best of times, it

was the best of

the best of times,

of times, it was

times, it was the

it was the worst

was the worst of

worst of times, it

the worst of times,

it was the age

was the age ofthe age of wisdom,

age of wisdom, it

of wisdom, it was

wisdom, it was the

Page 27: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Compressed de Bruijn Graph

Unambiguous non-branching paths replaced by single nodes An Eulerian traversal of the graph spells a compatible reconstruction

of the original text There may be many traversals of the graph

Different sequences can have the same string graph It was the best of times, it was the worst of times, it was the worst of

times, it was the age of wisdom, it was the age of foolishness, …

of times, it was the

It was the best of times, it

it was the age ofthe age of wisdom, it was the

it was the worst of times, it

the age of foolishness

Page 28: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Hadoopification…(Stay tuned!)

Page 29: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Cloud worthy?

Page 30: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

How much data?

Page 31: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Bottom Line: Bioinformatics Great use case of Hadoop Interesting computer science problems Help unravel life’s mysteries?

Page 32: Jimmy Lin, Michael Schatz, and Ben  Langmead University of Maryland Wednesday ,  June  10, 2009

Questions?Comments?

Thanks to the organizations who support our work: