Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion...

Post on 26-Dec-2015

223 views 0 download

Tags:

Transcript of Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion...

Genome Reconstruction: A Puzzle With a Billion Pieces

Genome Reconstruction: A Puzzle with a Billion

Pieces

Phillip Compeau & Pavel PevznerUniversity of California-San Diego

Genome Reconstruction: A Puzzle With a Billion Pieces

Outline

1. Introduction to Genome Sequencing

2. The Newspaper Problem

3. DNA Chips: A First Shot at Sequencing with Short Reads

4. Two Mathematical Detours

5. Introduction to Graph Theory

6. Euler’s Theorem

7. ECP vs. HCP and Algorithmic Complexity

8. From Euler and Hamilton to Fragment Assembly

9. De Bruijn and a Final Solution to Fragment Assembly

10. Generalizing Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 1: Introduction to Genome Sequencing

Genome Reconstruction: A Puzzle With a Billion Pieces

What Is Genome Sequencing?

• A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C.• A human genome has roughly 3 billion nucleotides.

• Genome sequencing is the process of determining the sequence of nucleotides that make up a genome.

...CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATGATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGATGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTATGACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGACTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACTGCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGACTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTCAGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…

Genome Reconstruction: A Puzzle With a Billion Pieces

What Is Genome Sequencing?

• Different people have slightly different genomes: all humans share 99.9% of the same genetic code.

• The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc.

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGTGACTATTATCGACTACAGATGAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Species Sequencing vs. Individual Genome Sequencing

• Species Sequencing: Determine the “consensus genome” of an entire species.

Genome Reconstruction: A Puzzle With a Billion Pieces

Species Sequencing vs. Individual Genome Sequencing

• Individual Sequencing: Determine how an individual differs from its species.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Species genome sequencing:• Compare various species (e.g. human and chimpanzee) to

understand how their genes function (e.g. which genes are important for braindevelopment).

• Reveal evolutionaryrelationships betweenspecies.

• Determine the geneticmakeup of ourevolutionary ancestors.

Why Would We Want to Sequence a Genome?

Genome Reconstruction: A Puzzle With a Billion Pieces

Why Would We Want to Sequence a Genome?

• Individual genome sequencing:• Unearth the genetic basis of many diseases.• Forensics applications.

• Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing.• Doctors could not diagnose his condition, which caused

strange infections; he went through nearly 100 surgeries. • Genome sequencing revealed a rare

mutation in a gene linked to a defect inhis immune system.

• This led doctors to use advancedimmunotherapy, which saved the child.

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Genome Sequencing

• Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods.

• 1980: They share the Nobel Prize in Chemistry.

• Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome.

Walter Gilbert

Frederick Sanger

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Genome Sequencing

• 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome.

• 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal.

Francis Collins

Craig Venter

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Mammalian Genome Sequencing

• 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.

Genome Reconstruction: A Puzzle With a Billion Pieces

Brief History of Mammalian Genome Sequencing

• 2000s: Many more mammalian genomes are sequenced.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Arrival of Personal Genomics

• 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude.

• 2010: The market for sequencing machines takes off.• Illumina reduces the cost of sequencing an individual human

genome from $3 billion to $10,000.• Complete Genomics builds a genomic factory in Silicon

Valley that sequences hundreds of genomes per month.• Beijing Genome Institute orders hundreds of sequencing

machines, becoming the world’s largest sequencing center.• 23andMe offers partial genome sequencing for $499.• Many universities introduce new courses in which students

study their own genomes.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Future of Genome Sequencing

• 2010s?: Genome sequencing will hopefully continue to bloom. • The $1,000 human genome may arrive as early as in 2012.• Hopefully, sequencing an individual genome will soon

become as routine as an X-ray.

Genome Reconstruction: A Puzzle With a Billion Pieces

What Makes Genome Sequencing So Difficult?

• When we read a book, we can read the entire book one letter at a time from the beginning to the end.

• However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. • Thus, we can identify very short fragments of DNA (~100

nucleotides long), called reads.• But we have no idea which genomic positions these reads

come from!• We must figure out how to put the reads back together to

assemble a genome.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 2: The Newspaper Problem and

Genome Sequencing

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces

The Newspaper Problem as an “Overlap Puzzle”

• The newspaper problem is not the same as a jigsaw puzzle:• We have multiple copies of the same

edition of a newspaper.• Plus, some pieces of paper got blown to

bits in the explosion.

• Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said.

• This gives us a giant overlap puzzle!

Genome Reconstruction: A Puzzle With a Billion Pieces

• In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.)

• However, the “language” of DNA remains largely unknown.

Sequencing is Harder than Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing is Harder than Newspaper Problem

• There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). • Example: GCTT is repeated 4 times in the following:

AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG

• Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult tosolve (even with just 16 pieces).

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Lab + Computation

• Read Generation (Experimental):Generate many reads from multiplecopies of the same genome.

• Fragment Assembly (Computational):Use these reads to algorithmicallyput the genome back together.

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Read Generation

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Reads

Read Generation

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Reads

Read Generation

Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces

Sequencing a Genome: Illustration

Multiple (Unsequenced) Genome Copies

Reads

Sequenced Genome

…GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC…

Read Generation

Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 3: DNA Chips: A First Shot at Sequencing

with Short Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: From an Idea to a New Industry

• 1989: Radoje Drmanac, Andrey Mirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation.

• Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome.

• 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.”• 2000: Arrays are a multi-billion dollar industry Southern

Mirzabekov

Drmanac

k-mer: A string of length k (in an alphabet of 4 nucleotides)

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Implementation

1. Synthesize a distinct k-mer in each of 4k cells in the array.

2. Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment.

3. DNA will hybridize witha k-mer if it contains thecomplement of that k-mer.

4. Use a spectroscope todetermine which sites emitlight …the complementsof these sites will reveal thek-mers in the unknownDNA fragment = our reads!

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Illustration

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads? AAA

AGA

CAA

CGA

GAA

GGA

TAA

TGA

AAC

AGC

CAC

CGC

GAC

GGC

TAC

TGC

AAG

AGG

CAG

CGG

GAG

GGG

TAG

TGG

AAT

AGT

CAT

CGT

GAT

GGT

TATTGT

ACA

ATA

CCA

CTA

GCA

GTA

TCA

TTA

ACC

ATC

CCC

CTC

GCC

GTC

TCC

TTC

ACG

ATG

CCG

CTG

GCG

GTG

TCG

TTG

ACT

ATT

CCT

CTT

GCT

GTT

TCT

TTT

Genome Reconstruction: A Puzzle With a Billion Pieces

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

DNA Chips: Example

• What are our reads?

CAT

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT|||

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

CAT

ATG

CAC

CGC

TGC

CAT

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

DNA Chips: Example

• What are our reads?

• So 3-mer ATG mustoccur in the genome!

ATG

CAC

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

• What are our reads?

CAC GTGCGC GCG•CAT ATGTGC GCAACG CGTATT AATCCA TGGGCA TGCGCC GGCTTG CAA

CAC

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

• What are our reads?• CACCGC GCG• CAT ATG

CAC

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

• What are our reads?• CAC GTGCGC GCG• CAT ATG

GTG

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

CGC

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

TGC

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

ACG

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

ATT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

CCA

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

GCA

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GCC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

TTG

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG

Genome Reconstruction: A Puzzle With a Billion Pieces

Red 3-mers Must Occur in the Genome

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

CAA

AAT

• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA

Genome Reconstruction: A Puzzle With a Billion Pieces

From Biological Data to Computational Problem

GTG

GCG

GCA

ATG

TGG

TGC

GGC

CGT

CAA

AAT

• Aim: Construct ashortest possible genomecontaining all our reads.

• This is now acomputational problem!

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 4: Two Mathematical Detours

Genome Reconstruction: A Puzzle With a Billion Pieces

The Bridges of Königsberg

• The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Bridges of Königsberg

• They wondered if they could walk through the city, cross each bridge (blue) exactly once, and return where they started.

Genome Reconstruction: A Puzzle With a Billion Pieces

The Bridges of Königsberg

• 1735: Leonhard Euler develops an approach to answer this question for any city, even for a “city” with a million islands.

• We will soon discuss Euler’s method as well as how it applies to genome sequencing. Leonhard Euler

Genome Reconstruction: A Puzzle With a Billion Pieces

The Icosian Game

• Over a century passes…

• 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 “islands” connected by “bridges.”

• Goal: find a walk that visits

every island exactly once and returns back where it started.

William Hamilton

Icosian Game

Genome Reconstruction: A Puzzle With a Billion Pieces

Similar Problems with Very Different Fates

• These two stories have something in common: • Find a walk that uses every bridge once

(Konigsberg Bridges Problem) • Find a walk that visits every island once (Hamilton

game)

• However, while Euler solved the first problem (even for a city with a million bridges), mathematicians still do not know how to solve the second problem, even for a city with a thousand islands.

• But where are the genomes???

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 5: Introduction to Graph Theory

Genome Reconstruction: A Puzzle With a Billion Pieces

Graphs

• A graph is a network composed of two sets of objects:• Vertices: each vertex is represented by a point. • Edges: each edge is represented by a

segment connecting two vertices.

• Graph theory can be applied to allkinds of different problems.• Transportation networks• Disease epidemics• Computer viruses spreading through the internet.• And, yes…genome sequencing!

Genome Reconstruction: A Puzzle With a Billion Pieces

Königsberg Bridges Graph

• For the Königsberg Bridge Problem, we create a graph:• Vertices = 4 land masses of the city• Edges = 7 bridges connecting land areas

Note: We don’t need to worry about the exact placement of vertices or the shape of bridges.

Genome Reconstruction: A Puzzle With a Billion Pieces

Icosian Game Graph

• For the Icosian Game, we create a graph:• Vertices = islands• Edges = bridges connecting the islands

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“Here I go!”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“…He wakes up in the morning…”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“…goes to visit his mommy…”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“…when all the little ants are marching…”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G. “…they all do it the same way…”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Now, consider an ant standing on a vertex of a graph G.

• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk

forms a cycle of G.

“Oh no! I’m back where I started!”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Two questions:

1. Is there a cycle of G in which the ant walks through each edge exactly once?

2. Is there a cycle of G in which the ant walks through each vertex exactly once?

“???!!!”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian and Hamiltonian Cycles

• Two questions:

1. Is there a cycle of G in which the ant walks through each edge exactly once? Eulerian cycle

2. Is there a cycle of G in which the ant walks through each vertex exactly once? Hamiltonian cycle

“I wish someone would name a cycle after me…I’m the one doing all the walking here!”

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

2

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

4

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

6

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

6

7

Genome Reconstruction: A Puzzle With a Billion Pieces

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Eulerian Cycles

1

23

45

6

78

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles

1

23

45

6

78

9

• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.

• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.

• However, no such cycle exists.

• If we add two more edges, there will be such a cycle; see it?

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

• For example, the graph correspondingto the Icosian game is Hamiltonian.

• This means that the Icosian gamehas a solution!

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 2

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

4

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

12

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

16

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

18

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

1819

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

1819

20

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles

• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.

• A graph containing such a cycle is called Hamiltonian.

1 23

45

6

7

8

9

10

11

1213

14

15

1617

1819

20

Genome Reconstruction: A Puzzle With a Billion Pieces

Finding Eulerian Cycles vs Hamiltonian Cycles

• Given a graph G, we now have two questions that we can program a computer to answer about G.

• Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian.

• Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G is not Hamiltonian.

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 6: Euler’s Theorem

Genome Reconstruction: A Puzzle With a Billion Pieces

Euler’s Theorem

• We will now discuss how Euler solved the Königsberg Bridge Problem.• You might guess: He used graph theory!• This is not entirely accurate. A better statement would be:

He invented graph theory!

Genome Reconstruction: A Puzzle With a Billion Pieces

Directed Graphs

• Directed Graph: A graph in which each edge has a direction (represented by an arrow).• You might like to think of directed edges as “one-way

bridges.”

Undirected Graph Directed Graph

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in Directed Graphs

• An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction.

• A directed graph is Eulerian if it contains an Eulerian cycle.

• Is this graph Eulerian? Why?

Genome Reconstruction: A Puzzle With a Billion Pieces

• indegree(v) = the number of edges leading into vertex v.• outdegree(v) = the number of edges leading out of v.

• A graph is balanced if indegree(v) = outdegree(v) for every vertex v.

• Label each vertex v with(indegree(v), outdegree(v))

• This graph isn’t balanced sincesome vertices don’t have equalindegree and outdegree.

Balanced Graphs

(1, 2)

(2, 1)

(1, 0)

(2, 1)

(1, 1)

(0, 2)(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• indegree(v) = the number of edges leading into vertex v.• outdegree(v) = the number of edges leading out of v.

• A graph is balanced if indegree(v) = outdegree(v) for every vertex v.

• Label each vertex v with(indegree(v), outdegree(v))

• Adding some edges makesthe graph balanced.

Balanced Graphs

(2, 2)

(2, 2)

(1, 1)

(2, 2)

(1, 1)

(2, 2)(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

Euler’s Theorem

• Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced.• A graph is connected if for every pair of vertices {u, v}, an

ant can travel either from u to v or from v to u.(2, 2)

(2, 2)

(1, 1)

(2, 2)

(1, 1)

(2, 2)(1, 1)

Not Connected Connected+ Balanced= Eulerian

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 7: ECP vs. HCP and Algorithmic Complexity

Genome Reconstruction: A Puzzle With a Billion Pieces

Solving the ECP

• By Euler’s Theorem, to determine whether G contains an Eulerian cycle, we only need to check if G is balanced.

• So we simply go to each vertex and perform this simple check:• If every vertex is balanced, then G must contain an Eulerian

cycle.• If some vertex is not balanced, then G cannot contain an

Eulerian cycle.

Genome Reconstruction: A Puzzle With a Billion Pieces

Connected + Balanced = Eulerian

(1, 2)

(2, 1)

(1, 0)

(1, 1)

(0, 2)(1, 1)

• Recall our example directed graph from before.

• Here the graph is not balanced, and so it clearly isn’t Eulerian.

(2, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Recall our example directed graph from before.

• Here the graph is not balanced, and so it clearly isn’t Eulerian.

• Adding the edges to make thegraph balanced will mean thatan Eulerian cyclemust exist.

Connected + Balanced = Eulerian

(2, 2)

(2, 2)

(1, 1)

(1, 1)

(2, 2)(1, 1)

1

2

3

7

65

4

89

10

11

(2, 2)

Genome Reconstruction: A Puzzle With a Billion Pieces

Connected + Balanced = Eulerian

• Recall our example directed graph from before.

• Here the graph is not balanced, and so it clearly isn’t Eulerian.

• Adding the edges to make thegraph balanced will mean thatan Eulerian cyclemust exist.

• One vital question remains:Where did this Eulerian cyclecome from?

(2, 2)

(2, 2)

(1, 1)

(2, 2)(1, 1)

1

2

7

65

4

89

10

11(1, 1)

(2, 2)3

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

(2, 2)

(2, 2)

(1, 1)

(2, 2)(1, 1)

(1, 1)

(2, 2)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(0, 1)

(2, 1)(1, 1)

(1, 1)

(2, 2)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 2)

(0, 0)

(2, 1)(1, 1)

(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(0, 0)

(2, 1)(1, 1)

(0, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been

previously traversed.• The ant must always walk along

edges in the legal direction.

• At each step, we updatethe remaining indegree andoutdegree of each vertex.

• Cycle! But not Eulerian yet…

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(0, 0)

(1, 1)(1, 1)

(0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(0, 0)

(1, 1)(1, 1)

(0, 0)

• Let’s cut out the cycle that the ant has found.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Let’s cut out the cycle that the ant has found.

(0, 0)

(0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Let’s cut out the cycle that the ant has found.

• Next delete vertices that are no longer connected to anything.

(0, 0)

(0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Let’s cut out the cycle that the ant has found.

• Next delete vertices that are no longer connected to anything.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(2, 2)

(2, 2)

(1, 1)

(1, 1)(1, 1)

• Again, let the ant walk through the graph however it chooses.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

Making an Eulerian Cycle from a Balanced Graph

(1, 2)

(2, 2)

(1, 1)

(1, 0)(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 2)

(1, 1)

(1, 0)(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)

(0, 1)

(1, 0)(1, 1)

“I really don’t see how this is going to give us an Eulerian cycle in the original graph…I knew I shouldn’t have left the house this morning!”

Genome Reconstruction: A Puzzle With a Billion Pieces

• Again, let the ant walk through the graph however it chooses.

• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.

• Cycle! But still not Eulerian…

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)

(0, 0)

(0, 0)(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)

(0, 0)

(0, 0)(1, 1)

• Let’s trim out this cycle one more time.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)(1, 1)

“Hmph! Dragged halfway across the screen…I guess I don’t have any say in the matter…”

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(1, 1)(1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

Making an Eulerian Cycle from a Balanced Graph

(1, 1)

(0, 1)(1, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

Making an Eulerian Cycle from a Balanced Graph

(0, 1)

(0, 0)(1, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

• Cycle! And Eulerian to boot…sowe have run out of edges.

Making an Eulerian Cycle from a Balanced Graph

(0, 0)

(0, 0)(0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s trim out this cycle one more time.

• The ant is stranded, so let’s move it to a vertex.

• Now there’s only one way that theant can walk through the graph.

• Cycle! And Eulerian to boot…sowe have run out of edges.

• What do we do now?

Making an Eulerian Cycle from a Balanced Graph

(0, 0)

(0, 0)(0, 0)

“Yes! What DO we do now?”

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s bring back our original graph.

Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces

• Let’s bring back our original graph.

• Highlight the three cycles that the ant found.

Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

Genome Reconstruction: A Puzzle With a Billion Pieces

• Start at the ant’s original position, and follow the green cycle.

• Cycle formed: we can continue along the blue cycle.

• Cycle formed; however, wenow have no new edgesto follow!

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

8

“???”

Genome Reconstruction: A Puzzle With a Billion Pieces

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

8

“Backtracking? But I’m not evolved to walk backwards!”

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

7

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.7

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.7

8

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

• Rejoin the blue cycle…

7

89

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

• Rejoin the blue cycle…

7

89

10

“I smell something good!”

Genome Reconstruction: A Puzzle With a Billion Pieces

Making an Eulerian Cycle from a Balanced Graph

1

2

3

4

5 6

• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

• Success! Now let’s follow the orange cycle.

• Rejoin the blue cycle…

• And we have the sameEulerian cycle from before!

7

89

10

11

“Yay! Now can I go home please?”

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s the Big Deal?

• The great thing about this method is that it can be easily generalized to any balanced graph to give an Eulerian cycle.

• “Yeah, but this Eulerian cycle wasn’t that hardto find anyway! So why shouldwe care about the method?”

• Think about trying toeyeball an Eulerian cyclein a graph containingbillions of edges. Not so easy…

1

2

3

4

5 6

7

89

10

11

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s the Big Deal?

• More profoundly, this method to find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer.

• Example: A modern computer canfind an Eulerian cycle in abalanced graph containingbillions of edges in undera minute!

1

2

3

4

5 6

7

89

10

11

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s the Big Deal?

• “Yeah, but computers are supermachines! They don’t really need 300-year old mathematics to help them solve problems. Aren’t they going to take over the world anyway?”

• So let’s examine the case of finding a Hamiltonian cycle…

Genome Reconstruction: A Puzzle With a Billion Pieces

Searching for an Efficient Algorithm for HCP

• Key Point: No one has ever founda similar efficient test to determinewhether a graph is Hamiltonian.

• Of course, we could examine everypossible (ant) walk through thegraph to solve the HCP.

• However, this brute force approachis just not efficient: there are morewalks through a graph on just 1,000vertices than there are atoms in the universe!

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

• In fact, the HCP has been classified as NP-Complete.

• In laymen’s terms, this means that the HCP belongs to a collection containing thousands of computational problems that cannot be solved quickly for large input data sets.

• NP-Complete problems are all equivalent to each other: find an efficient solution to one, and you have an efficient solution to them all.

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

“I can't find an efficient algorithm, I guess I'm just too dumb.”

From Garey and Johnson. Computers and Intractability. 1979

• Attempting to solve any NP-Complete problem is difficult.

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

“I can't find an efficient algorithm, because no such algorithm is possible.”

• Attempting to solve any NP-Complete problem is difficult.

• The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist.

From Garey and Johnson. Computers and Intractability. 1979

Genome Reconstruction: A Puzzle With a Billion Pieces

NP-Complete Problems

“I can't find an efficient algorithm, but neither can all these famous people.”

• Attempting to solve any NP-Complete problem is difficult.

• The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist.

• The present state of affairs is somewhere in between.

From Garey and Johnson. Computers and Intractability. 1979

Genome Reconstruction: A Puzzle With a Billion Pieces

The NP-Completeness of the HCP

• The question of whether or not NP-Complete problems (including the HCP) can be solved efficiently is one of seven Millennium Problems in mathematics.

• Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will get $1 million.

• However, if you become amathematician, odds are that you arenot in it for the $$$...recently, GrigoryPerelman solved one of theseproblems but turned down the prize.

Grigory Perelman, True Legend

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 8: From Euler and Hamilton to Fragment

Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces

Simplifying Assumptions for Fragment Assembly

1. Every k-mer occurring in the genome is generated by some read.

2. Reads are error-free.

3. Every k-mer occurring in the genome occurs exactly once.

4. The underlying genome consists of a single circular-shaped chromosome.

• Note: In the final section, we will relax these assumptions.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GTGGCG GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GTGGCG GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GCG GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GCA

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG GCGGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

ATG

TGG TGC

GGC

CGT CAA

AAT

GTG GCGGCAATG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

TGG TGC

GGC

CGT CAA

AAT

GTG GCGGCAATG TGG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

TGC

GGC

CGT CAA

AAT

GTG GCGGCAATG TGG TGC

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

GGC

CGT CAA

AAT

GTG GCGGCAATG TGG TGCGGC

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

CGT CAA

AAT

GTG GCGGCAATG TGG TGCGGCCGT

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

CAA

AAT

GTG GCGGCAATG TGG TGCGGCCGT CAA

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

AAT

GTG GCGGCAATG TGG TGCGGCCGT CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every readdetected by our array.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• Create a vertex for every k-merdetected by our array.• Prefix: First k – 1 nucleotides of a k-mer (CAA)• Suffix: Last k – 1 nucleotides of a k-mer (CAA)

• Different 3-mers may share a prefix/suffix: ATG, TGA, CTG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

First Try: The Graph H

• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG

ATGGenome:

T

G

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG

ATGGGenome:

T

G

G

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC

ATGGCGenome:

T

G

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG

ATGGCGGenome:

T

G

G

CG

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT

ATGGCGTGenome:

T

G

G

CG

T

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG

ATGGCGTG Genome:

T

G

G

CG

T

G

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC

ATGGCGTGC Genome:

T

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA

ATGGCGTGCAGenome:

T

G

G

CG

T

G

C

AA

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA

ATGGCGTGCAAGenome:

AT

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT

ATGGCGTGCAATGenome:

AT

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG

Genome:

AT

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG

Genome:

AT

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG

Genome:

AT

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATGGenome:

AT

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Hamiltonian Cycles in H

• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT ATG

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCA

Genome:

AT

G

G

CG

T

G

C

A

Genome Reconstruction: A Puzzle With a Billion Pieces

Problem with H

• Ultimately, we must solve the HCP on H in order to find a candidate DNA sequence…

• This idea motivated the method usedfor assembling the human genomefrom 50 million (long and expensive)reads in 2000, but the computational strain was overwhelming: sequencing the human genome took several computers a period of months, working around the clock.

• For that matter, newer sequencing technologies produce billions of (short and inexpensive) reads: we need a new idea.

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TGTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GCTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CATGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAATTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAATTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAATTGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

GT

TG GC

CG

CAAT

GG

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

GTG

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

GCGGTG

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

GCGGTG

GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

GCGGTG

GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGGGCGGTG

GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGGGCGGTG

TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCGGTG

TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAA

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Second Try: The Graph E

• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex

w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.

CAGC

CG

TG

GT

GG

AT

AA

TGCGGCCGTCAAAAT

GTGGCGGCAATGTGG

Reads

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

9

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT• This is the same sequence

of 3-mers that we had in H!ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT• This is the same sequence

of 3-mers that we had in H!• Thus we will obtain the same

sequenced genome as before.

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATGGenome:

A TG

GCGT

G

CA

Genome Reconstruction: A Puzzle With a Billion Pieces

Eulerian Cycles in E

• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC

GCA CAA AAT• This is the same sequence

of 3-mers that we had in H!• Thus we will obtain the same

sequenced genome as before.

ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCA

Genome:

A TG

GCGT

G

CA

Genome Reconstruction: A Puzzle With a Billion Pieces

Analysis of E

• Good News: We now only have to find an Eulerian cycle in the graph E, which could be done on this computer.

• Bad News:

1. There may be more than one Eulerian cycle in E.• We won’t discuss this issue here, but it can be resolved.

2. How do we know that E even has an Eulerian cycle?• By Euler’s Theorem, we only need to show that E is a

balanced graph.• To do this, we need one more piece of mathematical

history…

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 9: De Bruijn and Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• 1946: The Dutch mathematician Nicolaas de Bruijn asks: can we design a circular superstring of minimal length that contains every binary string of length k?

• Example for k = 3. The circular superstring ‘00011101’ contains all eight binary strings of length 3. We illustrate the locations of ‘000’ and ’110’ on the string.

Nicolaas de Bruijn

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• De Bruijn introduced a special class of graph B(n, k):• Vertices = all nk – 1 possible (k – 1)-mers in n-letter alphabet.• An edge connects v to w

if there is a k-merwhose prefix = v andwhose suffix = w.

• At right is B(2, 4),assuming that ouralphabet contains 0and 1.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• For any choice of n and k, B(n, k) must be balanced/Eulerian.

• Why? Because both the indegreeand the outdegree of everyvertex is equal to the sizeof the alphabet (n), sinceevery (k – 1)-mer willoccur as the prefix orsuffix of n different k-mers.

• Red numbers show the orderof edges in an Eulerian cycle.

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

De Bruijn’s Question

• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.

• E must be balanced/Eulerian too!• The indegree and outdegree

of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.

3

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGC GCA

CAAAAT

1

2

4

5

6

7 8

910

ATGGCGTGCAGenome:

Genome Reconstruction: A Puzzle With a Billion Pieces

Section 10: Generalizing Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces

Simplifying Assumptions for Fragment Assembly

• Recall the assumptions we have already made:

1. Every k-mer occurring in the genome is generated by some read.

2. Reads are error-free.

3. Every k-mer occurring in the genome occurs exactly once.

4. The underlying genome consists of a single circular-shaped chromosome.

• Our aim is to relax each of these assumptions and determine how the problem changes.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 1: Generating (nearly) all k-mers

• 100-nucleotide reads generated by Illumina sequencing technology capture only a small fraction of 100-mers from the genome (even for high-coverage sequencing projects), thus violating this key assumption of the de Bruijn graphs.

• However, if we break these reads into shorter k-mers, the resulting k-mers often represent nearly all k-mers from the genome for sufficiently small k.

• For example, modern assemblers often break every 100-nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs.

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:

• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.

Assumption 1: Generating (nearly) all k-mers

ATGCAAGCTAGCT

ATGCAA CAAGCT CTAGCTATGC CT

Reads

Genome

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 2: Handling Errors in Reads

• What happens to the graph E when some reads have errors?

• Example: Say our graph E for genome ATGGCGTGCAATG should look like this.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 2: Handling Errors in Reads

• What happens to the graph E when some reads have errors?

• Example: Say our graph E for genome ATGGCGTGCAATG should look like this.• If read TGGCGTG is mistakenly sequenced as TGGAGTG ,

then the graph will look like this instead.• This is called a bulge in the graph E.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 2: Handling Errors in Reads

• Most reads have errors, resulting in millions of bulges in E.

• 2004: Pevzner et al. provide algorithm for bulge removal.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• The genome ACGTACGT has only four 3-mers: ACG, CGT, GTA, and TAC.

• We would obtain the graph E below and reconstruct thisgenome as: ACGT

• In other words, we can’t representrepeated k-mers in the genome!

AC CG

GTTA

TAC

ACG

CGT

GTA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Define the multiplicity of a k-mer as the number of times it occurs in a genome.

• We will add edges to E in order to form a new graph E* for which the number of edges connecting two vertices represents the multiplicity of the k-mer on that edge.

• An Eulerian cycle in E* still gives a candidate genome.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 3: Handling Repeated k-mers

• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA

• Multiplicity 2: GCG, CGT,GTG, TGC

• We reflect multiplicities as

multiple edges • Candidate genome:

• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Determining k-mer multiplicities

• How can we find the multiplicity of a k-mer in the genome?

• The multiplicity of a k-mer willbe directly related to thefrequency with which thatk-mer occurs in our reads.

• So a k-mer thatappears 5 times inthe genome isexpected to occur 5 timesas often in our reads.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• The genomes for all complex organisms are split across a number of linear chromosomes (46 in humans).

• So in order to sequence thehuman genome, geneticistssimply sequenced all of theselinear chromosomes.

• Question: How do we sequencea linear segment of DNA?

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Say our linear DNA segment is ATGCGTGGCGTGCA.

• Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers:• CAA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Say our linear DNA segment is ATGCGTGGCGTGCA.

• Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers:• CAA• AAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

• Get rid of the vertex AA as well.

CAGC

CG

TG

GT

GG

AT

AA

ATG

TGG GGCGCG

CGT

GTG

TGCGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

• Get rid of the vertex AA as well.

CAGC

CG

TG

GT

GG

ATATG

TGG GGCGCG

CGT

GTG

TGCGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.

• Get rid of the vertex AA as well.

• So to sequence our segmentATGCGTGGCGTGCA,we need to find apath through E* thatstarts with AT, ends at CA,and uses every edge in between.

CAGC

CG

TG

GT

GG

ATATG

TGG GGCGCG

CGT

GTG

TGCGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once.• So an Eulerian path is just like an Eulerian cycle, except

that we don’t have to start and end at the same vertex.

• Luckily, Euler’s Theorem generalizes to efficiently determine whether a graph has an Eulerian path and then find this path.

• Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.

Genome Reconstruction: A Puzzle With a Billion Pieces

Assumption 4: From Circular to Linear Genomes

• Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.

• So E* must contain anEulerian path, because ATand CA (the endpoints ofour segment) are theonly two verticesthat aren’t balanced.

• Hence in every case we have solved our giant puzzle!

CAGC

CG

TG

GT

GG

ATATG

TGG GGCGCG

CGT

GTG

TGCGCA

Genome Reconstruction: A Puzzle With a Billion Pieces

What’s Next?

Genome Reconstruction: A Puzzle With a Billion Pieces

Personal Genomics: Millions of Human Genomes

• Personal genome sequencing startedfrom sequencing the genomes of afew scientists in 2009 and will soonexpand to millions of individuals.

• Thousands of cancer genomes havealready been sequenced, and genomesequencing will soon become aroutine technique in medicine.

• At the heart of this revolution are bioinformaticians, who must harness precise methods in order to analyze the growing data.

10 scientists and entrepreneurs who made their genomes publicly available in 2009

Genome Reconstruction: A Puzzle With a Billion Pieces

Genome 10K and Beyond

• 2010: Scientists launch anambitious project to sequence10,000 species genomes.

• 201x?: We will hopefullybe able to reconstruct the“tree of life” and uncover thegenomes of ancestors thatlived millions of years ago.

• 20xx?: Maybe, just maybe, we will be able to discover why giraffes grew necks and humans grew brains.