Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...

214
Lecture: Bioinformatics ENS Sacley, 2018

Transcript of Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...

Page 1: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Lecture: Bioinformatics

ENS Sacley, 2018

Page 2: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Lecture Overview

Mathias Weller

[email protected] Bulteau

[email protected]

- Introduction

I Molecular BiologyI SequencingI AssemblyI Scaffolding

- Phylogenetics

. . .

- Genome Rearrangements

. . .

- Scaffold Filling

2/27

Page 3: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Molecular Biology Basics

DNA

- double strand

- nucleotides paired: A–T, C–G

- inside nucleus (eucaryotes)

Polymerase

single strand → double strand

3/27

Page 4: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Molecular Biology Basics

DNA

- double strand

- nucleotides paired: A–T, C–G

- inside nucleus (eucaryotes)

Polymerase

single strand → double strand

3/27

Page 5: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Molecular Biology Basics

Chromosomes

- haploid = 1 set of chromosomes

- diploid = 2 sets of chromosomes

(usually one from each parent)

- procaryotes one circular chromosome, haploid

- eucaryotes set of chromosomes, diploid

44 90 16 46 46

4/27

Page 6: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Page 7: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

..AATCGCTAA..

..AATCCTAA..

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion,

deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Page 8: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

..AATCCTAA..

..AATCGCTAA..

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion,

substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Page 9: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

..AATCGCTAA..

..AATCACTAA..

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Page 10: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Page 11: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Page 12: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss),

translocation,

inversion, crossover

5/27

Page 13: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss), translocation,

inversion, crossover

5/27

Page 14: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss), translocation,

inversion,

crossover

5/27

Page 15: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss), translocation,

inversion, crossover

5/27

Page 16: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

mRNA, Introns, Exons, Genes

mRNA

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene

Gene = START. . . STOP (including

introns) (103-105bp)

6/27

Page 17: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

mRNA, Introns, Exons, Genes

mRNA

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene

Gene = START. . . STOP (including

introns) (103-105bp)

6/27

Page 18: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

mRNA, Introns, Exons, Genes

mRNA

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene

Gene = START. . . STOP (including

introns) (103-105bp)

6/27

Page 19: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

mRNA, Introns, Exons, Genes

6/27

Page 20: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 21: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 22: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 23: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 24: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 25: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*

GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 26: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*

GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 27: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 28: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 29: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Page 30: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 31: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 32: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 33: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 34: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 35: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 36: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 37: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 38: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 39: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Page 40: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Third-Generation Sequencing

Single Molecule Real Time Sequencing

1. fix a polymerase enzyme under a microscope

2. attach fluorescent molecule to each nucleotide

3. polymerase clips off fluorescent molecule when attaching a base

4. observe change in fluorescence identify base

Nanopores

1. “pore” of diameter ≈ 1nm

2. only unfolded DNA molecule can pass through

3. detect fluctuation in electric field as bases pass through pore

9/27

Page 41: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Third-Generation Sequencing

Single Molecule Real Time Sequencing

1. fix a polymerase enzyme under a microscope

2. attach fluorescent molecule to each nucleotide

3. polymerase clips off fluorescent molecule when attaching a base

4. observe change in fluorescence identify base

Nanopores

1. “pore” of diameter ≈ 1nm

2. only unfolded DNA molecule can pass through

3. detect fluctuation in electric field as bases pass through pore

9/27

Page 42: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Conclusion: Sequencing

Sanger

cost: 20ct/kbp

reads: 102-103kbp

errors: 0.01-1%

2nd Gen

cost: 4ct/kbp

reads: 50-500PE

errors: 0.1-1%

3rd Gen

cost: 0.4ct/kbp

reads: 104-106

errors: 5-10%

10 /27

Page 43: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTCACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

11 / 27

Page 44: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

11 / 27

Page 45: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Page 46: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Page 47: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Page 48: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Page 49: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 3: Shortest Common Superstring is NP-hard

Heuristic Assembly:

- Overlap-Layout-Consensus

- DeBruijn-Graph

11 / 27

Page 50: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 51: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Naive Overlap

AGGAGTCGAGTCCA→

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 52: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 53: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 54: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 55: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 56: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�

1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 57: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 58: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 59: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA�

4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 60: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�

5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 61: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 62: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 63: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 64: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

Exercise

Time

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 65: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 66: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 67: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 68: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 69: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 70: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 71: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGGTCTCA→

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 72: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGGTCTCA

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 73: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGAGTCTCA

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 74: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGAGTCTCA

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 75: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5 6

7

G

5 4 3 4 3 4 5

6

T

6 5 4 3 3 3 4

5

C

6 5 4 3 2 2 3

4

T

6 5 4 3 2 1 2

3

C

6 5 4 4 3 2 1

2

A

6 5 4 3 3 2 1

1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 76: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

6 7

G

5 4 3 4 3 4

5 6

T

6 5 4 3 3 3

4 5

C

6 5 4 3 2 2

3 4

T

6 5 4 3 2 1

2 3

C

6 5 4 4 3 2

1 2

A

6 5 4 3 3 2

1 1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 77: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

6 7

G

5 4 3 4 3 4

5 6

T

6 5 4 3 3 3

4 5

C

6 5 4 3 2 2

3 4

T

6 5 4 3 2 1

2 3

C

6 5 4 4 3 2

1 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 78: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 4 3 4 4 4 5 6 7

G 5 4 3 4 3 4 5 6

T 6 5 4 3 3 3 4 5

C 6 5 4 3 2 2 3 4

T 6 5 4 3 2 1 2 3

C 6 5 4 4 3 2 1 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 79: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2 1

0

G

4 3 2 1 0 1 1

0

T

5 4 3 2 1 0 1

0

C

5 4 3 2 1 1 0

0

T

5 4 3 2 1 0 1

0

C

6 5 4 3 2 1 0

0

A

6 5 4 3 3 2 1

0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)

Exercise

Time

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 80: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2

1 0

G

4 3 2 1 0 1

1 0

T

5 4 3 2 1 0

1 0

C

5 4 3 2 1 1

0 0

T

5 4 3 2 1 0

1 0

C

6 5 4 3 2 1

0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 81: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 82: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 83: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 84: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 85: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

transitive reduction

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 86: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

transitive reduction

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 87: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 88: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 89: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 90: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 91: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACProblems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 92: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 93: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Page 94: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27

Page 95: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Exercise

Time

..GAACTTCGCT..

.GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG..

.CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

Page 96: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

Page 97: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

Page 98: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

Page 99: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27

Page 100: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find Eulerian walk

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27

Page 101: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find Eulerian walk

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Running Time

#k-mers = O(#reads · read-length)

1. O(1) per k-mer

2. O(1) per k-mer

3. O(size of graph) = O(#k-mers)

Note: edges can be weighted by #occurances

13 /27

Page 102: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find Eulerian walk

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Problems

- choose k well

I k too small small repeats become problemsI k too big miss smaller overlaps

- Eulerian walk not neccessarily unique

- some paths in DeBruijn graph inconsistent with reads

- read-errors problematic

error-correction step before assembling

- same problem with repeats as OLC

13 /27

Page 103: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C

1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Page 104: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Page 105: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Page 106: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C

1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Page 107: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting Read Errors in DBG

Idea

“faulty” k-mers occur less often

than correct ones

build k-mer count histogram

De Bruijn

have to treat errors before

building the graph

k=30 & 1% error 1/4 faulty

15 /27

Page 108: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting k-mer Errors

Example

suppose: avg. k-mer count = 10 each k-mer occurs about 10x

GCGTATTACGCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACG 12x

TTACGC 9x

TACGCG 9x

ACGCGT 10x

CGCGTC 11x

GCGTCT 10x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

GCGTATTACTCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACT 1x

TTACTC 2x

TACTCG 2x

ACTCGT 1x

CTCGTC 1x

TCGTCT 1x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

16 /27

Page 109: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting k-mer Errors

17 /27

Page 110: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Correcting k-mer Errors

Problem

now we have an idea where an error is, but how to fix it?

Idea

errors turn frequent k-mers into infrequent ones

correction should turn infrequent k-mers into frequent ones

replace infrequent k-mer by “frequent neighbor”

18 /27

Page 111: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Page 112: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Page 113: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Page 114: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Page 115: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Page 116: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

20/27

Page 117: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 118: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGG

GTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 119: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 120: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 121: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 122: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 123: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 124: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Page 125: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp ∈ N

Question: Can M be covered by

≤σp alternating paths

σc alternating cycles

of total weight ≥ k?

20/27

Page 126: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

Question: Can M be covered by

≤σp alternating paths &

≤σc alternating cycles

of total weight ≥ k?

20/27

Page 127: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Exact ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

Question: Can M be covered by

σp alternating paths &

σc alternating cycles

of total weight ≥ k?

20/27

Page 128: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

19

10

71

26

22

32

59

71

78

2 60

85

1

38

1

17

19

7

25

21 /27

Page 129: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

21 /27

Page 130: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

1

63

94

25

9

28

29

55

2

17

1

67

62

1

5

71

26

24

47

83

71

34

71

68

72

1

51

19

79

54

151

67

6

66

34

68

59

2

23

11

12

1

5

1

2

6

1

49

13

71

28

71

5

6

11

33

67

24

88

82

2

42

25

1

39

14

65

5

2

70

24

89

46

1

1

1

80

1

73

64

36

72

33

73

67

59

1

14

1

1

70

3

1

33

53

2

2

1

135

4

2980

74

80

84

11

31

1

1

62

73

38

31

33

77

108

28

13

1 4

1

70

8

5

1

47

3

31

2

1

18

56

49

45

2

2

21 /27

Page 131: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D

2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

22/27

Page 132: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M

3. “slide” down all arrow tips & ignore directions

22/27

Page 133: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

22/27

Page 134: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G

22/27

Page 135: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh

alternating X covers M X

22/27

Page 136: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh

alternating X covers M X

22/27

Page 137: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇐”: contract each matching edge in the covering alternating path

hits all vertices exactly once X is valid directed path X

22/27

Page 138: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇐”: contract each matching edge in the covering alternating path

hits all vertices exactly once X is valid directed path X

22/27

Page 139: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

22/27

Page 140: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

22/27

Page 141: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.22/27

Page 142: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Exact Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Exact Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.22/27

Page 143: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding

1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 144: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 145: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 146: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 147: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 148: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 149: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 150: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 151: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 152: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 153: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 154: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 155: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

Page 156: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution X

Note: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

Page 157: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

Page 158: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

Page 159: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

Page 160: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

Page 161: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27

Page 162: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27

Page 163: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27

Page 164: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding

1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 165: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 166: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 167: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 168: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 169: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 170: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 171: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

Page 172: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution X

ω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

24/27

Page 173: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution Xω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

24/27

Page 174: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27

Page 175: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete (bipartite) graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27

Page 176: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete (bipartite) graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27

Page 177: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

25/27

Page 178: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

25/27

Page 179: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

25/27

Page 180: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Scaffolding with Multiplicities

70

5

47

67

70

135

49

3931

5

49

56

80

80

34

6

11

2489

19

25

83

73

82

71

9

14

63

72

64

28

29

33

66

33

68

94

62

62

11

47

18

67

8

6

74

59

12

34

46

88

42

79

77

13

17

26

24

70

65

45

33

73

67

23

71

108

67

36

13

55

5

38

72

51

71

5

28

73

31

71

59

80

54

84

28

29

71

151

24

5

6

33

68

53

25

×2

26/27

Page 181: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Scaffolding with Multiplicities

70

5

47

67

70

135

49

3931

5

49

56

80

80

34

6

11

2489

19

25

83

73

82

71

9

14

63

72

64

28

29

33

66

33

68

94

62

62

11

47

18

67

8

6

74

59

12

34

46

88

42

79

77

13

17

26

24

70

65

45

33

73

67

23

71

108

67

36

13

55

5

38

72

51

71

5

28

73

31

71

59

80

54

84

28

29

71

151

24

5

6

33

68

53

25

×2

26/27

Page 182: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

27/27

Page 183: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

27/27

Page 184: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

27/27

Page 185: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

27/27

Page 186: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

3 2 2 1 3 3 5 5 71

2 1 2 1

Proof

27/27

Page 187: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Page 188: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Page 189: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Page 190: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Page 191: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

5

2

22

result is collection of alternating paths & cycles

27/27

Page 192: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3 3 3

5

2

22

result is collection of alternating paths & cycles

27/27

Page 193: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

5

2

22

result is collection of alternating paths & cycles

27/27

Page 194: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

5

2

22

result is collection of alternating paths & cycles

27/27

Page 195: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

3

22

result is collection of alternating paths & cycles

27/27

Page 196: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

3

22

result is collection of alternating paths & cycles

27/27

Page 197: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 198: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 199: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 200: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 201: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 202: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 203: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 204: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Page 205: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Page 206: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Page 207: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Page 208: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Page 209: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Page 210: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Page 211: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Page 212: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Page 213: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Page 214: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27