Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...

Lecture: Bioinformatics

ENS Sacley, 2018

Lecture Overview

Mathias Weller

[email protected] Bulteau

[email protected]

- Introduction

I Molecular BiologyI SequencingI AssemblyI Scaffolding

- Phylogenetics

. . .

- Genome Rearrangements

. . .

- Scaffold Filling

2/27

[email protected]

[email protected]

Molecular Biology Basics

DNA

- double strand

- nucleotides paired: A–T, C–G

- inside nucleus (eucaryotes)

Polymerase

single strand → double strand

3/27

Molecular Biology Basics

Chromosomes

- haploid = 1 set of chromosomes

- diploid = 2 sets of chromosomes

(usually one from each parent)

- procaryotes one circular chromosome, haploid

- eucaryotes set of chromosomes, diploid

44 90 16 46 46

4/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Mutation

..AATCGCTAA..

..AATCCTAA..


- DNA damage


- insertion,

deletion, substitution

Replication Error

- DNA rearrangement


- duplication,



5/27

Mutation

..AATCCTAA..

..AATCGCTAA..


- DNA damage


- insertion, deletion,

substitution

Replication Error

- DNA rearrangement


- duplication,



5/27

Mutation

..AATCGCTAA..

..AATCACTAA..


- DNA damage



Replication Error

- DNA rearrangement


- duplication,



5/27

Mutation


- DNA damage



Replication Error

- DNA rearrangement


- duplication,



5/27

Mutation


- DNA damage



Replication Error

- DNA rearrangement


- duplication, deletion (loss),

translocation,


5/27

Mutation


- DNA damage



Replication Error

- DNA rearrangement


- duplication, deletion (loss), translocation,


5/27

Mutation


- DNA damage



Replication Error

- DNA rearrangement



inversion,

crossover

5/27

Mutation


- DNA damage



Replication Error

- DNA rearrangement




5/27

mRNA, Introns, Exons, Genes

mRNA

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene

Gene = START. . . STOP (including

introns) (103-105bp)

6/27

mRNA, Introns, Exons, Genes

6/27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27


CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing






Problem





7 /27



GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*

GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing






Problem





7 /27




GGA*GGACCTGCCCA*

GGACCTGCCCAGTCTGTA*

Sanger Sequencing






Problem





7 /27




GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing






Problem





7 /27

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27


ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG












8/27


ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC


TGGTACTCA......ACCTCTCAGACCTCTCAG












8/27


ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTC


TGGTACTCA......ACCTCTCAGACCTCTCAG












8/27


ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAG


ACCTCTCAG












8/27


ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA











8/27


ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA











8/27

Third-Generation Sequencing

Single Molecule Real Time Sequencing

1. fix a polymerase enzyme under a microscope

2. attach fluorescent molecule to each nucleotide

3. polymerase clips off fluorescent molecule when attaching a base

4. observe change in fluorescence identify base

Nanopores

1. “pore” of diameter ≈ 1nm

2. only unfolded DNA molecule can pass through

3. detect fluctuation in electric field as bases pass through pore

9/27

Conclusion: Sequencing

Sanger

cost: 20ct/kbp

reads: 102-103kbp

errors: 0.01-1%

2nd Gen

cost: 4ct/kbp

reads: 50-500PE

errors: 0.1-1%

3rd Gen

cost: 0.4ct/kbp

reads: 104-106

errors: 5-10%

10 /27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTCACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

11 / 27


TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC




11 / 27



ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC





Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27



ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC






11 / 27











11 / 27









Problem 3: Shortest Common Superstring is NP-hard

Heuristic Assembly:

- Overlap-Layout-Consensus

- DeBruijn-Graph

11 / 27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27




Naive Overlap

AGGAGTCGAGTCCA→

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps


Problems





12 /27




Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps


Problems





12 /27




Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case



Problems





12 /27




Suffix Trees



Problems





12 /27




Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps


Problems





12 /27




Suffix Trees

example: BANANA�

0

BANANA�

1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA




Problems





12 /27




Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA




Problems





12 /27




Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA�

4

NA

�NA�5

A

�NA




Problems





12 /27




Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�

5

A

�NA




Problems





12 /27




Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA




Problems





12 /27




Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA


can be improved to linear time & space



Problems





12 /27




Suffix Trees

AGGAGTC�GAGTCCA.

Exercise

Time

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps


Problems





12 /27




Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps


Problems





12 /27




Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)



Problems





12 /27




Fuzzy Overlap – Edit Distance

AGGAGTCGGTCTCA→

edit distance = # of insertions, deletions, and substitutions



Problems





12 /27





AGGAGTCGGTCTCA




Problems





12 /27





AGGAGTCGAGTCTCA




Problems





12 /27





dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5 6

7

G

5 4 3 4 3 4 5

6

T

6 5 4 3 3 3 4

5

C

6 5 4 3 2 2 3

4

T

6 5 4 3 2 1 2

3

C

6 5 4 4 3 2 1

2

A

6 5 4 3 3 2 1

1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps


Problems





12 /27







= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

6 7

G

5 4 3 4 3 4

5 6

T

6 5 4 3 3 3

4 5

C

6 5 4 3 2 2

3 4

T

6 5 4 3 2 1

2 3

C

6 5 4 4 3 2

1 2

A

6 5 4 3 3 2

1 1

7 6 5 4 3 2 1 0




Problems





12 /27







= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

6 7

G

5 4 3 4 3 4

5 6

T

6 5 4 3 3 3

4 5

C

6 5 4 3 2 2

3 4

T

6 5 4 3 2 1

2 3

C

6 5 4 4 3 2

1 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0




Problems





12 /27







= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 4 3 4 4 4 5 6 7

G 5 4 3 4 3 4 5 6

T 6 5 4 3 3 3 4 5

C 6 5 4 3 2 2 3 4

T 6 5 4 3 2 1 2 3

C 6 5 4 4 3 2 1 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0




Problems





12 /27







= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2 1

0

G

4 3 2 1 0 1 1

0

T

5 4 3 2 1 0 1

0

C

5 4 3 2 1 1 0

0

T

5 4 3 2 1 0 1

0

C

6 5 4 3 2 1 0

0

A

6 5 4 3 3 2 1

0

7 6 5 4 3 2 1 0


best overlaps with k errors in O(#reads2 · read-length2)

Exercise

Time



Problems





12 /27







= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2

1 0

G

4 3 2 1 0 1

1 0

T

5 4 3 2 1 0

1 0

C

5 4 3 2 1 1

0 0

T

5 4 3 2 1 0

1 0

C

6 5 4 3 2 1

0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0




Problems





12 /27







= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0




Problems





12 /27







= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0


best overlaps with k errors in O(#reads2 · read-length2)



Problems





12 /27






Problems





12 /27





Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems





12 /27





Overlap Graph

reads = vertices


GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

transitive reduction



Problems





12 /27





Overlap Graph

reads = vertices


GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT



Problems





12 /27





Overlap Graph

reads = vertices


GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT


only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT


Problems





12 /27






Problems





12 /27






TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC


TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACProblems





12 /27






TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC


TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Problems





12 /27






Problems





12 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27









k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Exercise

Time

..GAACTTCGCT..

.GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG..

.CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27









k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27









k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27







3. find Eulerian walk


k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27









k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Running Time

#k-mers = O(#reads · read-length)

1. O(1) per k-mer

2. O(1) per k-mer

3. O(size of graph) = O(#k-mers)

Note: edges can be weighted by #occurances

13 /27









k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Problems

- choose k well

I k too small small repeats become problemsI k too big miss smaller overlaps

- Eulerian walk not neccessarily unique

- some paths in DeBruijn graph inconsistent with reads

- read-errors problematic

error-correction step before assembling

- same problem with repeats as OLC

13 /27

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C

1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27


Example


9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C1

1

1


14 /27


Example


9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C

1

1

1


14 /27

Correcting Read Errors in DBG

Idea

“faulty” k-mers occur less often

than correct ones

build k-mer count histogram

De Bruijn

have to treat errors before

building the graph

k=30 & 1% error 1/4 faulty

15 /27

Correcting k-mer Errors

Example

suppose: avg. k-mer count = 10 each k-mer occurs about 10x

GCGTATTACGCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACG 12x

TTACGC 9x

TACGCG 9x

ACGCGT 10x

CGCGTC 11x

GCGTCT 10x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

GCGTATTACTCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACT 1x

TTACTC 2x

TACTCG 2x

ACTCGT 1x

CTCGTC 1x

TCGTCT 1x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

16 /27


17 /27


Problem

now we have an idea where an error is, but how to fix it?

Idea

errors turn frequent k-mers into infrequent ones

correction should turn infrequent k-mers into frequent ones

replace infrequent k-mer by “frequent neighbor”

18 /27

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

20/27


AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG





CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27


AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG





CGTCGCGG

ACTTGGG

GTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy





20/27


AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG





CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy





20/27


AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG





CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy





20/27


AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG





CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp ∈ N

Question: Can M be covered by

≤σp alternating paths

σc alternating cycles

of total weight ≥ k?

20/27


AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG





CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N


≤σp alternating paths &

≤σc alternating cycles


20/27


AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG





CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Exact ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N


σp alternating paths &

σc alternating cycles


20/27

19

10

71

26

22

32

59

71

78

2 60

85

1

38

1

17

19

7

25

21 /27

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

21 /27

1

63

94

25

9

28

29

55

2

17

1

67

62

1

5

71

26

24

47

83

71

34

71

68

72

1

51

19

79

54

151

67

6

66

34

68

59

2

23

11

12

1

5

1

2

6

1

49

13

71

28

71

5

6

11

33

67

24

88

82

2

42

25

1

39

14

65

5

2

70

24

89

46

1

1

1

80

1

73

64

36

72

33

73

67

59

1

14

1

1

70

3

1

33

53

2

2

1

135

4

2980

74

80

84

11

31

1

1

62

73

38

31

33

77

108

28

13

1 4

1

70

8

5

1

47

3

31

2

1

18

56

49

45

2

2

21 /27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D

2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

22/27


Recall: Scaffolding



Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M

3. “slide” down all arrow tips & ignore directions

22/27


Recall: Scaffolding



Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

22/27


Recall: Scaffolding



Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G

22/27


Recall: Scaffolding



Lemma


single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh

alternating X covers M X

22/27


Recall: Scaffolding



Lemma


single alternating path in G“⇐”: contract each matching edge in the covering alternating path

hits all vertices exactly once X is valid directed path X

22/27


Recall: Scaffolding



Theorem

Scaffolding is NP-hard, even restricted to

• bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

22/27


Recall: Scaffolding



Theorem


• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary


sufficiently dense graph class.

22/27


Recall: Scaffolding



Theorem



• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary


sufficiently dense graph class.22/27


Recall: Scaffolding



Theorem

Exact Scaffolding is NP-hard, even restricted to


• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Exact Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.22/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding

1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27


σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight


possible

23/27


σp = 1, σc = 1?



possible

Proof

Result S∗ is a valid solution X

Note: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27


σp = 1, σc = 1?



possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27


σp = 1, σc = 1?



possible

Theorem

Scaffolding in complete graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27


σp = 1, σc = 1?



possible

Theorem

Scaffolding in complete (bipartite) graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27


σp = 1, σc = 1?

Approximate Exact Scaffolding

1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27


σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect







cycle-edge

until at most σc cycles remain

24/27


σp = 1, σc = 1?








cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution X

ω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

24/27


σp = 1, σc = 1?








cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution Xω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

24/27


σp = 1, σc = 1?








cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27


σp = 1, σc = 1?








cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete (bipartite) graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

25/27

Scaffolding with Multiplicities

70

5

47

67

70

135

49

3931

5

49

56

80

80

34

6

11

2489

19

25

83

73

82

71

9

14

63

72

64

28

29

33

66

33

68

94

62

62

11

47

18

67

8

6

74

59

12

34

46

88

42

79

77

13

17

26

24

70

65

45

33

73

67

23

71

108

67

36

13

55

5

38

72

51

71

5

28

73

31

71

59

80

54

84

28

29

71

151

24

5

6

33

68

53

25

×2

26/27

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

27/27




3 2 2 1 3 3 5 5 71

2 1 2 1

Proof

27/27




Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27




Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

5

2

22

result is collection of alternating paths & cycles

27/27




Proof



3 3 3

5

2

22


27/27




Proof



3

5

2

22


27/27




Proof



3

3

22


27/27



(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27





Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path

information loss





27/27





Proposals


2. isolate each ambiguous path information loss





27/27





Proposals



3. cut as few ends as possible as hard as Vertex Cover



27/27





Proposals






Multiplicitiesone &

#non-matching adj. to contig

27/27





Proposals






27/27





Proposals




4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...

Documents

Transcript of Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...