Optical Mapping Data: Data Generation and AlgorithmsThe Burrows-Wheeler Transform (BWT) is a...

Post on 20-Aug-2020

3 views 0 download

Transcript of Optical Mapping Data: Data Generation and AlgorithmsThe Burrows-Wheeler Transform (BWT) is a...

OpticalMappingData:DataGenerationandAlgorithms

SamplePreparation

Sequencing

Assembly

Analysis

Fragments

Reads

Contigs

WhatisanOpticalMap?

GGCTTCCGACCACCACAACCGAATTATGAAGGATACCGAA

6,19,35

Opticalmapsareordered,genome-wide,high-resolutionrestrictionmaps.

- Muchlongerthanreads.Forexample,theaveragemapsizeforgoatcovers 360,000bp

- Nowcommerciallyavailable

.

IsolatedDNA Microfludic device

DNAiselongatedandcleavedontheopticalmappingsurface

Epiflourescence microscopewithCCDcamera

6 3 3 49

6 3 3 49

6 3 9 4

Genomewideopticalmap

“There is [..] a critical need for the continued development and public release of software tools for processing optical mapping data ..”

-GigaScience 2014

Goal:tooltoalignthecontigtoa segmentofan

opticalmap

SamplePreparation

Sequencing

Assembly

Analysis

Genome-wideopticalmap

contigs

OpticalMapData

• Previousapproachesusedynamicprogramming• Burrows-WheelerTransform(BWT)wouldimprovetimeefficiency

• ChallengesinapplyingBWT:(1)Sizingerrorand(2)alphabetsize

Challenges

6 3 9 4

5 4 9.5 6

ActualopticalmapvaluesOpticalmapobtainedfromexperiment

1 1 0.5 2SIZINGERROR

• Previousapproachesusedynamicprogramming• Burrows-WheelerTransform(BWT)wouldimprovetimeefficiency

• ChallengesinapplyingBWT:(1)Sizingerror and(2)alphabetsize

Challenges

6 3 9 4

5 4 9.5 6

ActualopticalmapvaluesOpticalmapobtainedfromexperiment

1 1 0.5 2SIZINGERROR

• Previousapproachesusedynamicprogramming• Burrows-WheelerTransform(BWT)wouldimprovetimeefficiency

• ChallengesinapplyingBWT:(1)Sizingerrorand(2)alphabetsize

Challenges

!𝑢𝑛𝑖𝑞𝑢𝑒𝑓𝑟𝑎𝑔𝑚𝑒𝑛𝑡𝑠𝑖𝑧𝑒𝑠 >�

16,000

Twin

SamplePreparation

Sequencing

Assembly

Analysis

Contigs

OpticalMapData

Alignmentofcontigstoopticalmap

Genome-wideopticalmap

Contig 1

Contig 2

Contig 3 Contig 5

Contig 4

TwinAlgorithm

1. Insilico digestcontigs intoopticalmaps.

TTTCCGACCACTTTTCCGAATTATGACCGAA

4,13,24

TwinAlgorithm

1. Insilico digestcontigs intoopticalmaps.2. BuildFM-index* andauxiliarydatastructures

onthegenome-wideopticalmap.

*adatastructurethatallowscompressionoftheinputtextwhilestillpermittingfastsubstringqueries

BWTandFM-indexAsuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.

1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n

3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n

acaaacgn

BWTandFM-indexAsuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.

The suffix array clusters all the occurrences of everypattern together into a contiguous range!

1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n

3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n

acaaacgn

Asuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.

The suffix array clusters all the occurrences of everypattern together into a contiguous range!

1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n

3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n

acaaacgn

BWTandFM-index

1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n

3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n

acaaacgn

Asuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.

The suffix array clusters all the occurrences of everypattern together into a contiguous range!

BWTandFM-index

3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n

1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n

acaaacgn

Asuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.

The suffix array clusters all the occurrences of everypattern together into a contiguous range!

BWTandFM-index

TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].

3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg

acaaacgn

BWTandFM-index

canaaacg

ExtractlastcolumnofSA

TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].

rankK(i): returnthenumberofK’sinS[1,i]

3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg

acaaacgn

BWTandFM-index

canaaacg

00012310

BWT rank

TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].

rankK(i): returnthenumberofK’sinS[1,i]

3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg

acaaacgn

BWTandFM-index

canaaacg

00012310

BWT rank

ranka[5] = 2

TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].

FM-indexisthecompressedversionoftheBWT andrank.

3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg

acaaacgn

BWTandFM-index

canaaacg

00012310

BWT rank

TwinAlgorithm

1. Insilico digestcontigs intoopticalmaps.2. BuildFM-indexandauxiliarydatastructures

onthegenome-wideopticalmap.3. UsingtheFM-indexwefindallalignments

betweentheopticalmapandtheinsilicodigestedcontigs.- ModifiedFM-indexBackwardSearchAlgorithm

FM-IndexBackwardSearchArecursivealgorithmforfindingsubstringsusingrank and BWT

rank[c]rank[a]

rank[a]

ModifiedFM-IndexBackwardSearch

• Sizingerrorandalphabet sizearechallengestoovercome

• Wecannotaffordabruteforceenumerationofthealphabetateachstepinthebackwardsearch

• Noveltyforopticalmaps:WaveletTree

WaveletTree

AWaveletTreeconvertsastringintoabalancedbinary-treeofbitvectors,wherea0replaceshalfofthesymbols,anda1replacestheotherhalf.Thisdefinitionisappliedrecursive

{A,C,G,T} is encoded as {0,0,1,1}

ACGTATATAGGAAGA001101010110010

WaveletTree

{A,C,G,T} is encoded as {0,0,1,1}

ACGTATATAGGAAGA001101010110010

WaveletTree

Noambiguity!

WaveletTree

ACGTATATAGGAAGA001101010110010

ACAAAAAA01000000

0

{A,C} is encoded as {0,1}

WaveletTree

ACGTATATAGGAAGA001101010110010

ACAAAAAA01000000

0

{G,T} is encoded as {0,1}

GTTTGGG0111000

1

Whichsymbolsin{A, G} existininputstring?

Tomatchx weneedtofindallthesubstringswithintherangex +/- y, fortolerancey.

ModifiedFM-IndexBackwardSearch

Tomatch9 weneedtofindallthesubstringswithintherange[6, 12] , fortolerance3.

ModifiedFM-IndexBackwardSearch

2,11,10,23,53,3,5,10,14,9,110, 1, 0, 1, 1,0,0, 0, 1,0, 1

Genomewideopticalmap

ModifiedFM-IndexBackwardSearch

2,11,10,23,53,3,5,10,14,9,110, 1, 0, 1, 1,0,0, 0, 1,0, 1

Tomatch9 weneedtofindallthesubstringswithintherange[6, 12] , fortolerance3.

2,10,3,5,10,90, 1,0,0, 1,1

11,23,53,14,110, 1, 1, 0, 0

2,3,50,0,1

10,9,100,1, 0

2,30,1

51

11,14,110, 1, 0

23,530, 1

ModifiedFM-IndexBackwardSearch

2,11,10,23,53,3,5,10,14,9,110, 1, 0, 1, 1,0,0, 0, 1,0, 1

Tomatch9 weneedtofindallthesubstringswithintherange[6, 12] , fortolerance3.

2,10,3,5,10,90, 1,0,0, 1,1

11,23,53,14,110, 1, 1, 0, 0

2,3,50,0,1

10,9,100,1, 0

2,30,1

51

11,14,110, 1, 0

23,530, 1

Arecursivealgorithmforfindingsubstringsusingrank and BWT

rank[c] rank[a]

rank[a]

ModifiedFM-IndexBackwardSearch

WaveletTreeQuery

TwinAlgorithm

1. Insilico digestcontigs intoopticalmaps.2. BuildFM-indexandauxiliarydatastructures

onthegenome-wideopticalmap.3. UsingtheFM-indexwefindallalignments

betweentheopticalmapandtheinsilicodigestedcontigs.

4. OutputthealignmentsinPSLformat.

TWINTestDatasets

TWINResults

Twinisthefirstalignmentmethodthatiscapableofhandlinglargegenomesizes

• Theonlyindex-basedtoolandisordersofmagnitudefasterthanexistingapproaches(patentpending)

• Pinetree(20Gb)wouldtake~84machineyearswithSOMAbutacouplehourswithTwin

TWIN:Optical Map Aligner

CORRECTINGERRORSINGENOMES

Mis-assemblyinGenomesMis-assembly: Significantlylargeinsertion,deletion,inversion,orrearrangementthatistheresultofdecisionsmadebytheassemblyprogram

Correctassembly

Rearrangement

Deletion

Insertion

A R R B

A R RB

A R B

A R R BR

Extensivevs.LocalMis-assemblies

ExtensiveMis-assembly:1kbp insizeandregionsaligntodifferentstrandsordifferentchromosomes.

LocalMis-assembly:smallerinsizeandonthesamestrandandsamechromosome.

DeBruijn GraphofaGenome

ExampleGenome:ABCDEFGHICDEFGKLExampleGenome:ABCDEFGHICDEFGKL

1 3

2

ABC BCD CDE DEF EFG FGK GKL

FGH

GHIHIC

ICD

DeBruijn GraphofaGenome

ABC BCD CDE DEF EFG FGK GKL

ExampleGenome:ABCDEFGHICDEFGKLExampleGenome:ABCDEFGHICDEFGKL

DeBruijn GraphofaGenome

ABC BCD CDE DEF EFG FGK GKL

ExampleGenome:ABCDEFGHICDEFGKLResultingErroneousGenome:ABCDEFGKL

1

SamplePreparation

Sequencing

Assembly

Analysis

Fragments

Reads

Contigs

misSEQuel*

RefinedContigs

Reads

Contigs

*(Muggli,Puglisi,Ronen,Boucher,ISMB2015)

SamplePreparation

Sequencing

Assembly

Analysis

Fragments

Reads

Contigs

OpticalMapData

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATAGGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATAGGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA

1 9

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.

2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.

SamplePreparation

Sequencing

ACGTAGAATCGACCATG

GGGACGTAGAATACGAC

ACGTAGAATACGTAGAA

Reads

Fragments

NextGenerationSequencing(NGS)

ACGTAGAATCGACCATGGGGACGTAGAATACGA

Paired-EndReads/Mate-PairReads

SamplePreparation

Sequencing

Fragments

ReadMatePairConcordance

A R R B

AR R B

A

R

R B

Correctassembly

Rearrangement

Inversion

ReadDepth

A R R B

A R BR R

RA B

Correctassembly

Insertion

Deletion

Red-BlackPositionalDeBruijn Graph

I. Chooseavalueof𝑘andΔ .II. Eachpositional𝑘-mer (sk)isanedgebetweentwo

positional𝑘–mers:prefix andsuffix ofsk.III. Positional𝑘–mers,sk-1 andsk-1’, aregluedifsk-1 andsk-1’

havethesamelabelandtheirdistancesdifferbyatmostΔ.IV. Ask-1 isredifthereaddepthistwostandarddeviationsfrom

themeanorthereisasignificantnumberofdisconcordinatereadalignments;otherwise,itisblack.

Apositional𝑘-mer isa𝑘-mer withanapproximateposition.

PositionalRedBlackdeBruijn GraphReadsaligned tocontigs:

Positionalk-mers withreaddepth:

PositionalRedBlackdeBruijnGraph:

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.

2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.

3. Removeallbulgesandwhirlsforthered-blackpositionaldeBruijn graph.

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.

2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.

3. Removeallbulgesandwhirlsforthered-blackpositionaldeBruijn graph.

Correctassembledcontigs Mis-assembledcontigs

A R R B A R RBA R BA R R BRA R R B

misSEQuel Algorithm

1. Alignsequencereadstocontigs usingastandardalignmenttool.

2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.

3. Removeallbulgesandwhirlsforthered-blackpositionaldeBruijn graph.

4. Contig refinementusingopticalmapalignment.

OpticalMapAlignment

NheI=G^CTAGC

E.Coliopticalmapsegment

A R R B

NheI=G^CTAGC

“GCTAGC”

OpticalMapAlignment

BA R R

NheI=G^CTAGC

CorrectlyAssembledContigs Align

BA R R

NheI=G^CTAGC

A R BR R

Mis-assembledContigs Don’tAlign

NheI=G^CTAGC

A R BR R

Mis-assembledContigs Don’tAlign

ResultsonTularensis

ResultsonTularensis

ResultsonTularensis

ResultsonTularensis

ResultsonTularensis

ResultsonTularensis

ResultsonTularensis

ResultsonTularensis

ResultsonTularensis

ResultsonPine

B

BA R R

ImprovePrediction

A RR R

B

ImprovePrediction

A RR R

Deletionbetweentwoalignedregions