Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion...

353
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University of California- San Diego

Transcript of Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion...

Page 1: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Genome Reconstruction: A Puzzle with a Billion

Pieces

Phillip Compeau & Pavel PevznerUniversity of California-San Diego

Page 2: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Outline

1. Introduction to Genome Sequencing

2. The Newspaper Problem

3. DNA Chips: A First Shot at Sequencing with Short Reads

4. Two Mathematical Detours

5. Introduction to Graph Theory

6. Euler’s Theorem

7. ECP vs. HCP and Algorithmic Complexity

8. From Euler and Hamilton to Fragment Assembly

9. De Bruijn and a Final Solution to Fragment Assembly

10. Generalizing Fragment Assembly

Page 3: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 1: Introduction to Genome Sequencing

Page 4: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

What Is Genome Sequencing?

• A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C.• A human genome has roughly 3 billion nucleotides.

• Genome sequencing is the process of determining the sequence of nucleotides that make up a genome.

...CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATGATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGATGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTATGACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGACTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACTGCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGACTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTCAGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…

Page 5: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

What Is Genome Sequencing?

• Different people have slightly different genomes: all humans share 99.9% of the same genetic code.

• The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc.

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGTGACTATTATCGACTACAGATGAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

Page 6: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Species Sequencing vs. Individual Genome Sequencing

• Species Sequencing: Determine the “consensus genome” of an entire species.

Page 7: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Species Sequencing vs. Individual Genome Sequencing

• Individual Sequencing: Determine how an individual differs from its species.

Page 8: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Species genome sequencing:• Compare various species (e.g. human and chimpanzee) to

understand how their genes function (e.g. which genes are important for braindevelopment).

• Reveal evolutionaryrelationships betweenspecies.

• Determine the geneticmakeup of ourevolutionary ancestors.

Why Would We Want to Sequence a Genome?

Page 9: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Why Would We Want to Sequence a Genome?

• Individual genome sequencing:• Unearth the genetic basis of many diseases.• Forensics applications.

• Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing.• Doctors could not diagnose his condition, which caused

strange infections; he went through nearly 100 surgeries. • Genome sequencing revealed a rare

mutation in a gene linked to a defect inhis immune system.

• This led doctors to use advancedimmunotherapy, which saved the child.

Page 10: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Genome Sequencing

• Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods.

• 1980: They share the Nobel Prize in Chemistry.

• Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome.

Walter Gilbert

Frederick Sanger

Page 11: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Genome Sequencing

• 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome.

• 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal.

Francis Collins

Craig Venter

Page 12: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Mammalian Genome Sequencing

• 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.

Page 13: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Mammalian Genome Sequencing

• 2000s: Many more mammalian genomes are sequenced.

Page 14: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Arrival of Personal Genomics

• 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude.

• 2010: The market for sequencing machines takes off.• Illumina reduces the cost of sequencing an individual human

genome from $3 billion to $10,000.• Complete Genomics builds a genomic factory in Silicon

Valley that sequences hundreds of genomes per month.• Beijing Genome Institute orders hundreds of sequencing

machines, becoming the world’s largest sequencing center.• 23andMe offers partial genome sequencing for $499.• Many universities introduce new courses in which students

study their own genomes.

Page 15: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Future of Genome Sequencing

• 2010s?: Genome sequencing will hopefully continue to bloom. • The $1,000 human genome may arrive as early as in 2012.• Hopefully, sequencing an individual genome will soon

become as routine as an X-ray.

Page 16: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

What Makes Genome Sequencing So Difficult?

• When we read a book, we can read the entire book one letter at a time from the beginning to the end.

• However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. • Thus, we can identify very short fragments of DNA (~100

nucleotides long), called reads.• But we have no idea which genomic positions these reads

come from!• We must figure out how to put the reads back together to

assemble a genome.

Page 17: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 2: The Newspaper Problem and

Genome Sequencing

Page 18: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Page 19: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Page 20: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Page 21: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Page 22: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Page 23: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Page 24: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem as an “Overlap Puzzle”

• The newspaper problem is not the same as a jigsaw puzzle:• We have multiple copies of the same

edition of a newspaper.• Plus, some pieces of paper got blown to

bits in the explosion.

• Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said.

• This gives us a giant overlap puzzle!

Page 25: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.)

• However, the “language” of DNA remains largely unknown.

Sequencing is Harder than Newspaper Problem

Page 26: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing is Harder than Newspaper Problem

• There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). • Example: GCTT is repeated 4 times in the following:

AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG

• Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult tosolve (even with just 16 pieces).

Page 27: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Lab + Computation

• Read Generation (Experimental):Generate many reads from multiplecopies of the same genome.

• Fragment Assembly (Computational):Use these reads to algorithmicallyput the genome back together.

Page 28: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Page 29: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Read Generation

Page 30: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Reads

Read Generation

Page 31: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Reads

Read Generation

Fragment Assembly

Page 32: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Reads

Sequenced Genome

…GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC…

Read Generation

Fragment Assembly

Page 33: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 3: DNA Chips: A First Shot at Sequencing

with Short Reads

Page 34: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: From an Idea to a New Industry

• 1989: Radoje Drmanac, Andrey Mirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation.

• Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome.

• 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.”• 2000: Arrays are a multi-billion dollar industry Southern

Mirzabekov

Drmanac

k-mer: A string of length k (in an alphabet of 4 nucleotides)

Page 35: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Implementation

1. Synthesize a distinct k-mer in each of 4k cells in the array.

2. Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment.

3. DNA will hybridize witha k-mer if it contains thecomplement of that k-mer.

4. Use a spectroscope todetermine which sites emitlight …the complementsof these sites will reveal thek-mers in the unknownDNA fragment = our reads!

Page 36: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Illustration

Page 37: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads? AAA

AGA

CAA

CGA

GAA

GGA

TAA

TGA

AAC

AGC

CAC

CGC

GAC

GGC

TAC

TGC

AAG

AGG

CAG

CGG

GAG

GGG

TAG

TGG

AAT

AGT

CAT

CGT

GAT

GGT

TATTGT

ACA

ATA

CCA

CTA

GCA

GTA

TCA

TTA

ACC

ATC

CCC

CTC

GCC

GTC

TCC

TTC

ACG

ATG

CCG

CTG

GCG

GTG

TCG

TTG

ACT

ATT

CCT

CTT

GCT

GTT

TCT

TTT

Page 38: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

DNA Chips: Example

• What are our reads?

CAT

Page 39: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT|||

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Page 40: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Page 41: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Page 42: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Page 43: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Page 44: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

• So 3-mer ATG mustoccur in the genome!

ATG

CAC

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Page 45: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

• What are our reads?

CAC GTGCGC GCG•CAT ATGTGC GCAACG CGTATT AATCCA TGGGCA TGCGCC GGCTTG CAA

CAC

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Page 46: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

• What are our reads?• CACCGC GCG• CAT ATG

CAC

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Page 47: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

• What are our reads?• CAC GTGCGC GCG• CAT ATG

GTG

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Page 48: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA

Page 49: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA

Page 50: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC

Page 51: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA

Page 52: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG

Page 53: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT

Page 54: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT

Page 55: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT

Page 56: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA

Page 57: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG

Page 58: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA

Page 59: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC

Page 60: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC

Page 61: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC

Page 62: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG

Page 63: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

CAA

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA

Page 64: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

From Biological Data to Computational Problem

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

CAA

AAT

• Aim: Construct ashortest possible genomecontaining all our reads.

• This is now acomputational problem!

Page 65: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 4: Two Mathematical Detours

Page 66: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Bridges of Königsberg

• The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.

Page 67: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Bridges of Königsberg

• They wondered if they could walk through the city, cross each bridge (blue) exactly once, and return where they started.

Page 68: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Bridges of Königsberg

• 1735: Leonhard Euler develops an approach to answer this question for any city, even for a “city” with a million islands.

• We will soon discuss Euler’s method as well as how it applies to genome sequencing. Leonhard Euler

Page 69: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Icosian Game

• Over a century passes…

• 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 “islands” connected by “bridges.”

• Goal: find a walk that visits

every island exactly once and returns back where it started.

William Hamilton

Icosian Game

Page 70: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Similar Problems with Very Different Fates

• These two stories have something in common: • Find a walk that uses every bridge once

(Konigsberg Bridges Problem) • Find a walk that visits every island once (Hamilton

game)

• However, while Euler solved the first problem (even for a city with a million bridges), mathematicians still do not know how to solve the second problem, even for a city with a thousand islands.

• But where are the genomes???

Page 71: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 5: Introduction to Graph Theory

Page 72: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Graphs

• A graph is a network composed of two sets of objects:• Vertices: each vertex is represented by a point. • Edges: each edge is represented by a

segment connecting two vertices.

• Graph theory can be applied to allkinds of different problems.• Transportation networks• Disease epidemics• Computer viruses spreading through the internet.• And, yes…genome sequencing!

Page 73: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Königsberg Bridges Graph

• For the Königsberg Bridge Problem, we create a graph:• Vertices = 4 land masses of the city• Edges = 7 bridges connecting land areas

Note: We don’t need to worry about the exact placement of vertices or the shape of bridges.

Page 74: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Icosian Game Graph

• For the Icosian Game, we create a graph:• Vertices = islands• Edges = bridges connecting the islands

Page 75: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

Page 76: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“Here I go!”

Page 77: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“…He wakes up in the morning…”

Page 78: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“…goes to visit his mommy…”

Page 79: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“…when all the little ants are marching…”

Page 80: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G. “…they all do it the same way…”

Page 81: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“Oh no! I’m back where I started!”

Page 82: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Two questions:

1. Is there a cycle of G in which the ant walks through each edge exactly once?

2. Is there a cycle of G in which the ant walks through each vertex exactly once?

“???!!!”

Page 83: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Two questions:

1. Is there a cycle of G in which the ant walks through each edge exactly once? Eulerian cycle

2. Is there a cycle of G in which the ant walks through each vertex exactly once? Hamiltonian cycle

“I wish someone would name a cycle after me…I’m the one doing all the walking here!”

Page 84: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

Page 85: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Page 86: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

Page 87: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

2

Page 88: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

Page 89: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

4

Page 90: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

Page 91: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

6

Page 92: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

6

7

Page 93: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

6

78

Page 94: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles

1

23

45

6

78

9

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Page 95: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

• For example, the graph correspondingto the Icosian game is Hamiltonian.

• This means that the Icosian gamehas a solution!

Page 96: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1

Page 97: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 2

Page 98: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

Page 99: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

4

Page 100: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

Page 101: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

Page 102: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

Page 103: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

Page 104: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

Page 105: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

Page 106: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

Page 107: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

12

Page 108: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

Page 109: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

Page 110: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

Page 111: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

16

Page 112: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

Page 113: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

18

Page 114: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

1819

Page 115: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

1819

20

Page 116: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

1819

20

Page 117: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Finding Eulerian Cycles vs Hamiltonian Cycles

• Given a graph G, we now have two questions that we can program a computer to answer about G.

• Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian.

• Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G is not Hamiltonian.

Page 118: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 6: Euler’s Theorem

Page 119: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Euler’s Theorem

• We will now discuss how Euler solved the Königsberg Bridge Problem.• You might guess: He used graph theory!• This is not entirely accurate. A better statement would be:

He invented graph theory!

Page 120: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Directed Graphs

• Directed Graph: A graph in which each edge has a direction (represented by an arrow).• You might like to think of directed edges as “one-way

bridges.”

Undirected Graph Directed Graph

Page 121: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in Directed Graphs

• An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction.

• A directed graph is Eulerian if it contains an Eulerian cycle.

• Is this graph Eulerian? Why?

Page 122: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• indegree(v) = the number of edges leading into vertex v.• outdegree(v) = the number of edges leading out of v.

• A graph is balanced if indegree(v) = outdegree(v) for every vertex v.

• Label each vertex v with(indegree(v), outdegree(v))

• This graph isn’t balanced sincesome vertices don’t have equalindegree and outdegree.

Balanced Graphs

(1, 2)

(2, 1)

(1, 0)

(2, 1)

(1, 1)

(0, 2)(1, 1)

Page 123: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• indegree(v) = the number of edges leading into vertex v.• outdegree(v) = the number of edges leading out of v.

• A graph is balanced if indegree(v) = outdegree(v) for every vertex v.

• Label each vertex v with(indegree(v), outdegree(v))

• Adding some edges makesthe graph balanced.

Balanced Graphs

(2, 2)

(2, 2)

(1, 1)

(2, 2)

(1, 1)

(2, 2)(1, 1)

Page 124: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Euler’s Theorem

• Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced.• A graph is connected if for every pair of vertices {u, v}, an

ant can travel either from u to v or from v to u.(2, 2)

(2, 2)

(1, 1)

(2, 2)

(1, 1)

(2, 2)(1, 1)

Not Connected Connected+ Balanced= Eulerian

Page 125: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 7: ECP vs. HCP and Algorithmic Complexity

Page 126: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Solving the ECP

• By Euler’s Theorem, to determine whether G contains an Eulerian cycle, we only need to check if G is balanced.

• So we simply go to each vertex and perform this simple check:• If every vertex is balanced, then G must contain an Eulerian

cycle.• If some vertex is not balanced, then G cannot contain an

Eulerian cycle.

Page 127: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Connected + Balanced = Eulerian

(1, 2)

(2, 1)

(1, 0)

(1, 1)

(0, 2)(1, 1)

• Recall our example directed graph from before.

• Here the graph is not balanced, and so it clearly isn’t Eulerian.

(2, 1)

Page 128: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Recall our example directed graph from before.

• Here the graph is not balanced, and so it clearly isn’t Eulerian.

• Adding the edges to make thegraph balanced will mean thatan Eulerian cyclemust exist.

Connected + Balanced = Eulerian

(2, 2)

(2, 2)

(1, 1)

(1, 1)

(2, 2)(1, 1)

1

2

3

7

65

4

89

10

11

(2, 2)

Page 129: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Connected + Balanced = Eulerian

• Recall our example directed graph from before.

• Here the graph is not balanced, and so it clearly isn’t Eulerian.

• Adding the edges to make thegraph balanced will mean thatan Eulerian cyclemust exist.

• One vital question remains:Where did this Eulerian cyclecome from?

(2, 2)

(2, 2)

(1, 1)

(2, 2)(1, 1)

1

2

7

65

4

89

10

11(1, 1)

(2, 2)3

Page 130: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

(2, 2)

(2, 2)

(1, 1)

(2, 2)(1, 1)

(1, 1)

(2, 2)

Page 131: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(0, 1)

(2, 1)(1, 1)

(1, 1)

(2, 2)

Page 132: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 2)

(0, 0)

(2, 1)(1, 1)

(1, 1)

Page 133: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(0, 0)

(2, 1)(1, 1)

(0, 1)

Page 134: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

• Cycle! But not Eulerian yet…

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(0, 0)

(1, 1)(1, 1)

(0, 0)

Page 135: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(0, 0)

(1, 1)(1, 1)

(0, 0)

• Let’s cut out the cycle that the ant has found.

Page 136: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Let’s cut out the cycle that the ant has found.

(0, 0)

(0, 0)

Page 137: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Let’s cut out the cycle that the ant has found.

• Next delete vertices that are no longer connected to anything.

(0, 0)

(0, 0)

Page 138: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Let’s cut out the cycle that the ant has found.

• Next delete vertices that are no longer connected to anything.

Page 139: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Again, let the ant walk through the graph however it chooses.

Page 140: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

Making an Eulerian Cycle from a Balanced Graph

(1, 2)

(2, 2)

(1, 1)

(1, 0)(1, 1)

Page 141: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 2)

(1, 1)

(1, 0)(1, 1)

Page 142: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)

(0, 1)

(1, 0)(1, 1)

“I really don’t see how this is going to give us an Eulerian cycle in the original graph…I knew I shouldn’t have left the house this morning!”

Page 143: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

• Cycle! But still not Eulerian…

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)

(0, 0)

(0, 0)(1, 1)

Page 144: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)

(0, 0)

(0, 0)(1, 1)

• Let’s trim out this cycle one more time.

Page 145: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)(1, 1)

Page 146: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)(1, 1)

“Hmph! Dragged halfway across the screen…I guess I don’t have any say in the matter…”

Page 147: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)(1, 1)

Page 148: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(0, 1)(1, 0)

Page 149: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

Making an Eulerian Cycle from a Balanced Graph

(0, 1)

(0, 0)(1, 0)

Page 150: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

• Cycle! And Eulerian to boot…sowe have run out of edges.

Making an Eulerian Cycle from a Balanced Graph

(0, 0)

(0, 0)(0, 0)

Page 151: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

• Cycle! And Eulerian to boot…sowe have run out of edges.

• What do we do now?

Making an Eulerian Cycle from a Balanced Graph

(0, 0)

(0, 0)(0, 0)

“Yes! What DO we do now?”

Page 152: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s bring back our original graph.

Making an Eulerian Cycle from a Balanced Graph

Page 153: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s bring back our original graph.

• Highlight the three cycles that the ant found.

Making an Eulerian Cycle from a Balanced Graph

Page 154: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

Page 155: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

Page 156: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

Page 157: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

Page 158: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

Page 159: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

Page 160: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5

Page 161: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

Page 162: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

Page 163: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

• Cycle formed; however, wenow have no new edgesto follow!

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

8

“???”

Page 164: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

8

“Backtracking? But I’m not evolved to walk backwards!”

Page 165: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

Page 166: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

Page 167: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.7

Page 168: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.7

8

Page 169: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

• Rejoin the blue cycle…

7

89

Page 170: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

• Rejoin the blue cycle…

7

89

10

“I smell something good!”

Page 171: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

• Rejoin the blue cycle…

• And we have the sameEulerian cycle from before!

7

89

10

11

“Yay! Now can I go home please?”

Page 172: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s the Big Deal?

• The great thing about this method is that it can be easily generalized to any balanced graph to give an Eulerian cycle.

• “Yeah, but this Eulerian cycle wasn’t that hardto find anyway! So why shouldwe care about the method?”

• Think about trying toeyeball an Eulerian cyclein a graph containingbillions of edges. Not so easy…

1

2

3

4

5 6

7

89

10

11

Page 173: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s the Big Deal?

• More profoundly, this method to find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer.

• Example: A modern computer canfind an Eulerian cycle in abalanced graph containingbillions of edges in undera minute!

1

2

3

4

5 6

7

89

10

11

Page 174: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s the Big Deal?

• “Yeah, but computers are supermachines! They don’t really need 300-year old mathematics to help them solve problems. Aren’t they going to take over the world anyway?”

• So let’s examine the case of finding a Hamiltonian cycle…

Page 175: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Searching for an Efficient Algorithm for HCP

• Key Point: No one has ever founda similar efficient test to determinewhether a graph is Hamiltonian.

• Of course, we could examine everypossible (ant) walk through thegraph to solve the HCP.

• However, this brute force approachis just not efficient: there are morewalks through a graph on just 1,000vertices than there are atoms in the universe!

Page 176: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

• In fact, the HCP has been classified as NP-Complete.

• In laymen’s terms, this means that the HCP belongs to a collection containing thousands of computational problems that cannot be solved quickly for large input data sets.

• NP-Complete problems are all equivalent to each other: find an efficient solution to one, and you have an efficient solution to them all.

Page 177: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

“I can't find an efficient algorithm, I guess I'm just too dumb.”

From Garey and Johnson. Computers and Intractability. 1979

• Attempting to solve any NP-Complete problem is difficult.

Page 178: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

“I can't find an efficient algorithm, because no such algorithm is possible.”

• Attempting to solve any NP-Complete problem is difficult.

• The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist.

From Garey and Johnson. Computers and Intractability. 1979

Page 179: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

“I can't find an efficient algorithm, but neither can all these famous people.”

• Attempting to solve any NP-Complete problem is difficult.

• The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist.

• The present state of affairs is somewhere in between.

From Garey and Johnson. Computers and Intractability. 1979

Page 180: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

The NP-Completeness of the HCP

• The question of whether or not NP-Complete problems (including the HCP) can be solved efficiently is one of seven Millennium Problems in mathematics.

• Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will get $1 million.

• However, if you become amathematician, odds are that you arenot in it for the $$$...recently, GrigoryPerelman solved one of theseproblems but turned down the prize.

Grigory Perelman, True Legend

Page 181: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 8: From Euler and Hamilton to Fragment

Assembly

Page 182: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Simplifying Assumptions for Fragment Assembly

1. Every k-mer occurring in the genome is generated by some read.

2. Reads are error-free.

3. Every k-mer occurring in the genome occurs exactly once.

4. The underlying genome consists of a single circular-shaped chromosome.

• Note: In the final section, we will relax these assumptions.

Page 183: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GTGGCG GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

Page 184: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GTGGCG GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG

Page 185: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GCG GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG GCG

Page 186: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG GCGGCA

Page 187: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG GCGGCAATG

Page 188: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

TGG TGC

GGC

CGT CAA

AAT

GTG GCGGCAATG TGG

Page 189: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

TGC

GGC

CGT CAA

AAT

GTG GCGGCAATG TGG TGC

Page 190: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GGC

CGT CAA

AAT

GTG GCGGCAATG TGG TGCGGC

Page 191: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

CGT CAA

AAT

GTG GCGGCAATG TGG TGCGGCCGT

Page 192: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

CAA

AAT

GTG GCGGCAATG TGG TGCGGCCGT CAA

Page 193: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

AAT

GTG GCGGCAATG TGG TGCGGCCGT CAAAAT

Page 194: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 195: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every k-merdetected by our array.• Prefix: First k – 1 nucleotides of a k-mer (CAA)• Suffix: Last k – 1 nucleotides of a k-mer (CAA)

• Different 3-mers may share a prefix/suffix: ATG, TGA, CTG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 196: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

Page 197: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

Page 198: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 199: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 200: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 201: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 202: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 203: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 204: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 205: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 206: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 207: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 208: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 209: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 210: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 211: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 212: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 213: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 214: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 215: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 216: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 217: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 218: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Page 219: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

Page 220: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG

ATGGenome:

T

G

A

Page 221: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG

ATGGGenome:

T

G

G

A

Page 222: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC

ATGGCGenome:

T

G

G

C

A

Page 223: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG

ATGGCGGenome:

T

G

G

CG

A

Page 224: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT

ATGGCGTGenome:

T

G

G

CG

T

A

Page 225: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG

ATGGCGTG Genome:

T

G

G

CG

T

G

A

Page 226: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC

ATGGCGTGC Genome:

T

G

G

CG

T

G

C

A

Page 227: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA

ATGGCGTGCAGenome:

T

G

G

CG

T

G

C

AA

Page 228: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA

ATGGCGTGCAAGenome:

AT

G

G

CG

T

G

C

A

Page 229: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT

ATGGCGTGCAATGenome:

AT

G

G

CG

T

G

C

A

Page 230: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG

Genome:

AT

G

G

CG

T

G

C

A

Page 231: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG

Genome:

AT

G

G

CG

T

G

C

A

Page 232: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG

Genome:

AT

G

G

CG

T

G

C

A

Page 233: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATGGenome:

AT

G

G

CG

T

G

C

A

Page 234: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCA

Genome:

AT

G

G

CG

T

G

C

A

Page 235: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Problem with H

• Ultimately, we must solve the HCP on H in order to find a candidate DNA sequence…

• This idea motivated the method usedfor assembling the human genomefrom 50 million (long and expensive)reads in 2000, but the computational strain was overwhelming: sequencing the human genome took several computers a period of months, working around the clock.

• For that matter, newer sequencing technologies produce billions of (short and inexpensive) reads: we need a new idea.

Page 236: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 237: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 238: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TGTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 239: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GCTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 240: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 241: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 242: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CATGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 243: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAATTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 244: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAATTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 245: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAATTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 246: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 247: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 248: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 249: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 250: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 251: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 252: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 253: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 254: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 255: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 256: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 257: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 258: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Page 259: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

GTG

Page 260: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

GCGGTG

Page 261: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

GCGGTG

GCA

Page 262: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

GCGGTG

GCA

Page 263: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGGGCGGTG

GCA

Page 264: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGGGCGGTG

TGC GCA

Page 265: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCGGTG

TGC GCA

Page 266: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

Page 267: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAA

Page 268: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

Page 269: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

Page 270: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

Page 271: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

Page 272: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

Page 273: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

Page 274: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

Page 275: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

Page 276: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

Page 277: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7

Page 278: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

Page 279: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

9

Page 280: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

Page 281: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

Page 282: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT• This is the same sequence

of 3-mers that we had in H!ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG

Page 283: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT• This is the same sequence

of 3-mers that we had in H!• Thus we will obtain the same

sequenced genome as before.

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATGGenome:

A TG

GCGT

G

CA

Page 284: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT• This is the same sequence

of 3-mers that we had in H!• Thus we will obtain the same

sequenced genome as before.

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCA

Genome:

A TG

GCGT

G

CA

Page 285: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Analysis of E

• Good News: We now only have to find an Eulerian cycle in the graph E, which could be done on this computer.

• Bad News:

1. There may be more than one Eulerian cycle in E.• We won’t discuss this issue here, but it can be resolved.

2. How do we know that E even has an Eulerian cycle?• By Euler’s Theorem, we only need to show that E is a

balanced graph.• To do this, we need one more piece of mathematical

history…

Page 286: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 9: De Bruijn and Fragment Assembly

Page 287: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• 1946: The Dutch mathematician Nicolaas de Bruijn asks: can we design a circular superstring of minimal length that contains every binary string of length k?

• Example for k = 3. The circular superstring ‘00011101’ contains all eight binary strings of length 3. We illustrate the locations of ‘000’ and ’110’ on the string.

Nicolaas de Bruijn

Page 288: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• De Bruijn introduced a special class of graph B(n, k):• Vertices = all nk – 1 possible (k – 1)-mers in n-letter alphabet.• An edge connects v to w

if there is a k-merwhose prefix = v andwhose suffix = w.

• At right is B(2, 4),assuming that ouralphabet contains 0and 1.

Page 289: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• For any choice of n and k, B(n, k) must be balanced/Eulerian.

• Why? Because both the indegreeand the outdegree of everyvertex is equal to the sizeof the alphabet (n), sinceevery (k – 1)-mer willoccur as the prefix orsuffix of n different k-mers.

• Red numbers show the orderof edges in an Eulerian cycle.

Page 290: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 291: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 292: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 293: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 294: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 295: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 296: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 297: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 298: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 299: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 300: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Page 301: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 10: Generalizing Fragment Assembly

Page 302: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Simplifying Assumptions for Fragment Assembly

• Recall the assumptions we have already made:

1. Every k-mer occurring in the genome is generated by some read.

2. Reads are error-free.

3. Every k-mer occurring in the genome occurs exactly once.

4. The underlying genome consists of a single circular-shaped chromosome.

• Our aim is to relax each of these assumptions and determine how the problem changes.

Page 303: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 1: Generating (nearly) all k-mers

• 100-nucleotide reads generated by Illumina sequencing technology capture only a small fraction of 100-mers from the genome (even for high-coverage sequencing projects), thus violating this key assumption of the de Bruijn graphs.

• However, if we break these reads into shorter k-mers, the resulting k-mers often represent nearly all k-mers from the genome for sufficiently small k.

• For example, modern assemblers often break every 100-nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs.

Page 304: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 305: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 306: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 307: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 308: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 309: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 310: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 311: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 312: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 313: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 314: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 315: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 316: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 317: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Page 318: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 2: Handling Errors in Reads

• What happens to the graph E when some reads have errors?

• Example: Say our graph E for genome ATGGCGTGCAATG should look like this.

Page 319: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 2: Handling Errors in Reads

• What happens to the graph E when some reads have errors?

• Example: Say our graph E for genome ATGGCGTGCAATG should look like this.• If read TGGCGTG is mistakenly sequenced as TGGAGTG ,

then the graph will look like this instead.• This is called a bulge in the graph E.

Page 320: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 2: Handling Errors in Reads

• Most reads have errors, resulting in millions of bulges in E.

• 2004: Pevzner et al. provide algorithm for bulge removal.

Page 321: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• The genome ACGTACGT has only four 3-mers: ACG, CGT, GTA, and TAC.

• We would obtain the graph E below and reconstruct thisgenome as: ACGT

• In other words, we can’t representrepeated k-mers in the genome!

AC CG

GTTA

TAC

ACG

CGT

GTA

Page 322: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Define the multiplicity of a k-mer as the number of times it occurs in a genome.

• We will add edges to E in order to form a new graph E* for which the number of edges connecting two vertices represents the multiplicity of the k-mer on that edge.

• An Eulerian cycle in E* still gives a candidate genome.

Page 323: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 324: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 325: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 326: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 327: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 328: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 329: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 330: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 331: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 332: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 333: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 334: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 335: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 336: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 337: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 338: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 339: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Page 340: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Determining k-mer multiplicities

• How can we find the multiplicity of a k-mer in the genome?

• The multiplicity of a k-mer willbe directly related to thefrequency with which thatk-mer occurs in our reads.

• So a k-mer thatappears 5 times inthe genome isexpected to occur 5 timesas often in our reads.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

Page 341: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• The genomes for all complex organisms are split across a number of linear chromosomes (46 in humans).

• So in order to sequence thehuman genome, geneticistssimply sequenced all of theselinear chromosomes.

• Question: How do we sequencea linear segment of DNA?

Page 342: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Say our linear DNA segment is ATGCGTGGCGTGCA.

• Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers:• CAA

Page 343: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Say our linear DNA segment is ATGCGTGGCGTGCA.

• Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers:• CAA• AAT

Page 344: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

Page 345: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

Page 346: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

• Get rid of the vertex AA as well.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

Page 347: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

• Get rid of the vertex AA as well.

CAGC

CG

TG

GT

GG

ATATG

TGG GGCGCG

CGT

GTG

TGCGCA

Page 348: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

• Get rid of the vertex AA as well.

• So to sequence our segmentATGCGTGGCGTGCA,we need to find apath through E* thatstarts with AT, ends at CA,and uses every edge in between.

CAGC

CG

TG

GT

GG

ATATG

TGG GGCGCG

CGT

GTG

TGCGCA

Page 349: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once.• So an Eulerian path is just like an Eulerian cycle, except

that we don’t have to start and end at the same vertex.

• Luckily, Euler’s Theorem generalizes to efficiently determine whether a graph has an Eulerian path and then find this path.

• Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.

Page 350: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.

• So E* must contain anEulerian path, because ATand CA (the endpoints ofour segment) are theonly two verticesthat aren’t balanced.

• Hence in every case we have solved our giant puzzle!

CAGC

CG

TG

GT

GG

ATATG

TGG GGCGCG

CGT

GTG

TGCGCA

Page 351: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s Next?

Page 352: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Personal Genomics: Millions of Human Genomes

• Personal genome sequencing startedfrom sequencing the genomes of afew scientists in 2009 and will soonexpand to millions of individuals.

• Thousands of cancer genomes havealready been sequenced, and genomesequencing will soon become aroutine technique in medicine.

• At the heart of this revolution are bioinformaticians, who must harness precise methods in order to analyze the growing data.

10 scientists and entrepreneurs who made their genomes publicly available in 2009

Page 353: Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Genome Reconstruction: A Puzzle With a Billion Pieces

Genome 10K and Beyond

• 2010: Scientists launch anambitious project to sequence10,000 species genomes.

• 201x?: We will hopefullybe able to reconstruct the“tree of life” and uncover thegenomes of ancestors thatlived millions of years ago.

• 20xx?: Maybe, just maybe, we will be able to discover why giraffes grew necks and humans grew brains.