Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly...

40
Title Sebastian Schmeier [email protected] http://sschmeier.com/bioinf-workshop/ 2016

Transcript of Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly...

Page 1: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

Title

Sebastian [email protected]

http://sschmeier.com/bioinf-workshop/2016

Page 2: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  De novo genome assembly•  The fragment assembly problem•  Shortest superstring problem•  Seven bridges of Königsberg•  Assembly as a graph theoretical problem•  We construct a de Bruijn graph•  Underlying assumptions of genome assemblies

2Sebastian Schmeier

Page 3: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  The process of generating a new genome sequence from NGS genome sequence reads based on assembly algorithms

•  Assembly involves joining short sequence fragments together into long pieces – contigs

3

reads

genome

sequencing

Sebastian Schmeier

Page 4: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

4

ATGCG

GCGTG

GTGGC

TGGCA

Sebastian Schmeier

Page 5: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

5

ATGCG

GCGTG

GTGGC

TGGCA

Vertex

Sebastian Schmeier

Page 6: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

6

ATGCG

GCGTG

GTGGC

TGGCA

Vertex

Edge

Sebastian Schmeier

Page 7: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

7

ATGCG

GCGTG

GTGGC

TGGCA

Vertex

Edge

Sebastian Schmeier

Page 8: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

8

ATGCG

GCGTG

GTGGC

TGGCA

Vertex

Edge

Sebastian Schmeier

Page 9: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

9

ATGCG

GCGTG

GTGGC

TGGCA

Vertex

ATGCGGCGTG

GTGGCTGGCA

ATGCGTGGCAgenome

Edge

Sebastian Schmeier

Page 10: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  Given: A set of reads (strings) {s1, s2, … , sn }•  Do: Determine a large string s that “best explains” the reads

•  What do we mean by “best explains”?•  What assumptions might we require?

10Sebastian Schmeier

Page 11: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  Objective: Find a string s such that•  all reads s1, s2, … , sn are substrings of s•  s is as short as possible

•  Assumptions:§  Reads are 100% accurate§  Identical reads must come from the same location on the genome§  “best” = “simplest”

11Sebastian Schmeier

Page 12: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

12

ATGCG

GCGTG

GTGGC

TGGCA

12

3

ATGCGGCGTG

GTGGCTGGCA

ATGCGTGGCA

•  The assumption is that all substrings are represented•  Even modern sequencers that generate 100nt reads do not cover all possible 100-mers

Sebastian Schmeier

Page 13: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

13

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

Sebastian Schmeier

Page 14: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

14

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

makethemunique

Sebastian Schmeier

Page 15: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

15

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

ATG

TGC

GCG CGTGTG

TGG GGC

GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy

Sebastian Schmeier

Page 16: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

16

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

ATG

TGC

GCG CGTGTG

TGG GGC

GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy

Sebastian Schmeier

Page 17: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

17

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

ATG

TGC

GCG CGTGTG

TGG GGC

GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy

Sebastian Schmeier

Page 18: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

18

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

ATG

TGC

GCG CGTGTG

TGG GGC

GCA Drawedgefromxtoywheresuffixfromxoverlapsprefixfromy

Sebastian Schmeier

Page 19: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

19

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

ATG

TGC

GCG CGTGTG

TGG GGC

GCA

FindHamiltonianpath,thatis,apaththatvisitseveryvertexexactlyonceRecordtheFirstle0erofeachvertex+Allle0ersoflastvertex

Sebastian Schmeier

Page 20: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

20

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3ß  UNFORTUNATELY: The Hamiltonian path problem is very difficult to solve (np-complete)

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

ATG

TGC

GCG CGTGTG

TGG GGC

GCA

FindHamiltonianpath,thatis,apaththatvisitseveryvertexexactlyonceRecordtheFirstle0erofeachvertex+Allle0ersoflastvertex

ATGCGTGGCA

Sebastian Schmeier

Page 21: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

21

ATGCGGCGTG

GTGGCTGGCA

•  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3ß  UNFORTUNATELY: The Hamiltonian path problem is very difficult to solve (np-complete)

ATG, TGC, GCG, GCG, CGT, GTGGTG, TGG GGCTGG, GGC, GCA

3-mers

ATG

TGC

GCG CGTGTG

TGG GGC

GCA

FindHamiltonianpath,thatis,apaththatvisitseveryvertexexactlyonceRecordtheFirstle0erofeachvertex+Allle0ersoflastvertex

ATGCGTGGCAATGGCGTGCA

Sebastian Schmeier

Page 22: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  In 1735 Leonhard Euler was presented with the following problem:§  find a walk through the city that would cross each bridge once and only once§  He proved that a connected graph with undirected edges contains an Eulerian cycle

exactly when every node in the graph has an even number of edges touching it.§  For the Königsberg Bridge graph, this is not the case because each of the four nodes

has an odd number of edges touching it and so the desired stroll through the city does not exist.

22https://en.wikipedia.org/wiki/Leonhard_Euler Sebastian Schmeier

Page 23: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

23

A B

D

Vertex

Directed edge

•  The degree of a vertex: # of edges connected to it•  outdegree: # of outgoing edges•  indegree: # of ingoing edges•  degree(B)?•  outdegree(B)?•  indegree(D)?

C

Sebastian Schmeier

Page 24: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  The case of directed graphs is similar :§  A graph in which indegrees are equal to outdegrees for all nodes is

called 'balanced'. §  Euler's theorem states that a connected directed graph has an Eulerian

cycle if and only if it is balanced.

•  Mathematically/computationally finding Eulerian path is much easier than Hamiltonian

à we need to reformulate our assembly problem

24Sebastian Schmeier

Page 25: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

25

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

Sebastian Schmeier

Page 26: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

26

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT

Sebastian Schmeier

Page 27: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

27

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

Sebastian Schmeier

Page 28: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

28

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

Sebastian Schmeier

Page 29: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

29

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

Sebastian Schmeier

Page 30: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

30

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

ATG

Sebastian Schmeier

Page 31: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

31

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

ATG

TGG

Sebastian Schmeier

Page 32: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

We construct a de Bruijn graph:§  edges represent k-mers §  vertices correspond to (k-1)-mers

1.  Form a node for every distinct prefix or suffix of a k-mer2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x

(e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

32

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

ATG

TGG GGC

TGC

GTG GCGCGT

GCA

Sebastian Schmeier

Page 33: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

Can we find a DNA sequence containing all k-mers?à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à  Eulerian path•  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1•  a connected graph has an Eulerian path if and only if it contains at most two semibalanced

vertices

33

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

ATG

TGG GGC

TGC

GTG GCGCGT

GCA

Sebastian Schmeier

Page 34: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

Can we find a DNA sequence containing all k-mers?à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à  Eulerian path•  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1•  a connected graph has an Eulerian path if and only if it contains at most two semibalanced

vertices

34

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

ATG

TGG GGC

TGC

GTG GCGCGT

GCA

Sebastian Schmeier

Page 35: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

Can we find a DNA sequence containing all k-mers?à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à  Eulerian path

35

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

ATG

TGG GGC

TGC

GTG GCGCGT

GCA

ATGGCGTGCA

Sebastian Schmeier

Page 36: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

Can we find a DNA sequence containing all k-mers?à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?à  Eulerian path

36

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

AT TG

GG

GC CA

CGGT

ATG

TGG GGC

TGC

GTG GCGCGT

GCA

ATGGCGTGCA

ATGCGTGGCASebastian Schmeier

Page 37: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  Four hidden assumptions that do not hold for next-generation sequencing We took for granted that:1.  we can generate all k-mers present in the genome2.  all k-mers are error free3.  each k-mer appears at most once in the genome 4.  the genome consists of a single chromosome

37Sebastian Schmeier

Page 38: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

•  Four hidden assumptions that do not hold for next-generation sequencing We took for granted that:1.  we can generate all k-mers present in the genome2.  all k-mers are error free

•  Due to these reasons we do NOT choose the longest possible k-mer•  The smaller the k-mer the higher the possibility that we see all k-mers•  Errors:

38

ATGGCGTGCAATGTGGGGCGCGCGTGTGTGCGCA

Mostlyunaffectedk-mers

ATGGCGTGCA100%affectedk-mers

K=3

K=10

Sebastian Schmeier

Page 39: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

Each k-mer appears at most once in the genome à repeats•  This is most often not true•  This is known as k-mer multiplicity

39

k-mers: ATG, GCA, TGC, TGC, GTG, GTG, GCG, GCG, CGT, CGT

Distinct (k-1)-mers:

AT TG GC CA

CGGT

ATG TGC

GTG GCGCGT

GCA

ATGCGTGCGTGCA

Sebastian Schmeier

Page 40: Sebastian Schmeier 2016 - Amazon S3...2016 • De novo genome assembly • The fragment assembly problem • Shortest superstring problem • Seven bridges of Königsberg • Assembly

‹#› 40

ReferencesHow to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler. Nature Biotechnology29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011

Sequence Assembly. Lecture by Mark Craven ([email protected]). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011

Sebastian [email protected]://sschmeier.com/bioinf-workshop/