Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... ·...

52
Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational Biology Duke University July 23, 2018

Transcript of Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... ·...

Page 1: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

ShortReadAlignmentAlgorithms

RalucaGordân

DepartmentofBiostatisticsandBioinformaticsDepartmentofComputerScience

DepartmentofMolecularGeneticsandMicrobiologyCenterforGenomicandComputationalBiology

DukeUniversity

July23,2018

Page 2: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Sequencingapplications

ChIP-seq: identify and measure significant peaks    GAAATTTGC GGAAATTTG CGGAAATTT CGGAAATTT TCGGAAATT CTATCGGAAA CCTATCGGA TTTGCGGT GCCCTATCG AAATTTGC…CC GCCCTATCG AAATTTGC ATAC……CCATAGGCTATATGCGCCCTATCGGAAATTTGCGGTATAC…

Genotyping: identify variations GGTATAC……CCATAG TATGCGCCC CGGAAATTT CGGTATAC…CCAT CTATATGCG TCGGAAATT CGGTATAC…CCAT GGCTATATG CTATCGGAAA GCGGTATA…CCA AGGCTATAT CCTATCGGA TTGCGGTA C……CCA AGGCTATAT GCCCTATCG TTTGCGGT C……CC AGGCTATAT GCCCTATCG AAATTTGC ATAC……CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC……CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

RNA-seq – this course

Page 3: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Heterochromatin

Transcripts

Transcription

Translation

Alternative splicing

Euchromatin

Three-dimensionalarrangement ofchromatin5C or ChIA-PET

DNA methylation

MeDIP-seqRRBSMethylC-seq

Nascent transcriptbound to RNA PolII

GRO-SeqNET-seq

Quantification ofmature transcriptsand small RNA

RNA-seq

Transcription factor-binding sites

ChIP-seq

Histonemodifications

ChIP-seq

Open or accessibleregions in euchromatin

FAIREDNAse-seq

lncRNA–chromatininteraction

ChIRP-seq

Cancer

Alternate splice variants

RNA-seq

Translational efficiency

Ribo-seq

ONH

H3C

NH2

N

Biomarkers Personal omics profiling

Repressivehistone marks,e.g., H3K27me3

Activatinghistone marks,e.g., H3K4me3

Activatinghistone marks,e.g., H3K27ac

Figure 2 Sequencing technologies and their uses. Various NGS methods can precisely map and quantify chromatin features, DNA modifications and several specific stepsin the cascade of information from transcription to translation. These technologies can be applied in a variety of medically relevant settings, including uncovering regulatorymechanisms and expression profiles that distinguish normal and cancer cells, and identifying disease biomarkers, particularly regulatory variants that fall outside of protein-coding regions. Together, these methods can be used for integrated personal omics profiling to map all regulatory and functional elements in an individual. Using this basalprofile, dynamics of the various components can be studied in the context of disease, infection, treatment options, and so on. Such studies will be the cornerstone ofpersonalized and predictive medicine.

High-throughput sequencingWW Soon et al

4 Molecular Systems Biology 2013 & 2013 EMBO and Macmillan Publishers Limited

Soon, Hariharan, Snyder Molecular Systems Biology 2012

ATAC-seq

Page 4: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Sequencingtechnologies

https://en.wikipedia.org/wiki/DNA_sequencing

TECHNICALBIASES!

Page 5: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Sequencealignment

Heuristiclocalalignment(BLAST)

•  INPUT:–  Database

–  Query:PSKMQRGIKWLLP•  OUTPUT:

–  sequencessimilartoquery

Global/localalignment(Needleman-Wunsch,Smith-Waterman)

X = x1x2…xm Y = y1y2…yn

•  INPUT:–  Twosequences

•  OUTPUT:–  OptimalalignmentbetweenXandY(orsubstringsofXandY)

Page 6: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Shortreadalignment

•  INPUT:–  Afewmillionshortreads,withcertainerrorcharacteristics(specifictothesequencingplatform)

–  Illumina:fewerrors,mostlysubstitutions–  Areferencegenome

•  OUTPUT:–  Alignmentsofthereadstothereferencegenome

•  CanweuseBLAST?

•  Algorithmsforexactstringmatchingaremoreappropriate

•  AssumingBLASTreturnstheresultforareadin1sec•  For10millionreads:10millionseconds=116days

Page 7: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Algorithmsforexactstringmatching

•  SearchforthesubstringANAinthestringBANANA

BruteForce SuffixArray

PacBioAligner(BLASR);Bowtie

BinarySearch

BLAST,BLAT

Seed-and-extend

Hash(Index)Table

Timecomplexityversusspacecomplexity

Naive

Slow&Easy

Page 8: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

BruteforcesearchforGATTACA

•  WhereisGATTACAinthehumangenome?

No match at offset 1

Match at offset 2

No match at offset 3...

Page 9: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

BruteforcesearchforGATTACA

•  Simple,easytounderstand•  Analysis

–Genomelength=n=3,000,000,000–Querylength=m=7–Comparisons:(n-m+1)*m=21,000,000,000

•  Assumingeachcomparisontakes1/1,000,000ofasecond…•  …thetotalrunningtimeis21,000seconds=0.24days•  …forone7-bpread

Page 10: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

SuffixarraysSuffixarray

•  Preprocessthegenome–  Sortallthesuffixesofthegenome

Split into suffixes Sort suffixes alphabetically

•  Usebinarysearch

Page 11: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-searchforGATTACA

Lo = 1; Hi = 15 Lo

Hi

Middle = Suffix[8] = CC

Compare GATTACA to CC => Higher

Lo = Mid + 1

Mid = (1+15)/2 = 8

Page 12: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-searchforGATTACA

Lo = 9; Hi = 15

Lo

Hi

Middle = Suffix[12] = TACC

Compare GATTACA to TACC => Lower

Hi = Mid - 1

Mid = (9+15)/2 = 12

Page 13: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-searchforGATTACA

Lo = 9; Hi = 11

Lo

Hi

Middle = Suffix[10] = GATTACC

Compare GATTACA to GATTACC => Lower

Hi = Mid - 1

Mid = (9+11)/2 = 10

Page 14: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-searchforGATTACA

Lo = 9; Hi = 9

Lo Hi

Middle = Suffix[9] = GATTACAG…

Compare GATTACA to GATTACAG… => Match

Return: match at position 2

Mid = (9+9)/2 = 9

Page 15: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-analysis

•  Word(query)ofsizem=7•  Genomeofsizen=3,000,000,000•  Bruceforce:

–  approx.mxn=21,000,000,000comparisons•  Suffixarrays:

–  approx.mxlog2(n)=7x32=224comparisons

•  Assumingeachcomparisontakes1/1,000,000ofasecond…•  …thetotalrunningtimeis0.000224secondsforone7-bpread•  Comparedto0.24daysforone7-bpreadinthecaseofbruteforcesearch

•  For10millionreads,thesuffixarraysearchwouldtake2240seconds=37minutes

Page 16: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-analysis

•  Word(query)ofsizem=7•  Genomeofsizen=3,000,000,000

•  For10millionreads,thesuffixarraysearchwouldtake2240seconds=37minutes

•  Problem?

•  Forthehumangenome: 4.5billionbillioncharacters!!!

•  Totalcharactersinallsuffixescombined:1+2+3+…+n=

Timecomplexityversusspacecomplexity

n(n+1)/2

Page 17: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Algorithmsforexactstringmatching

•  SearchforthesubstringANAinthestringBANANA

BruteForce SuffixArray

PacBioAligner(BLASR);Bowtie

BinarySearch

BLAST,BLAT

Seed-and-extend

Hash(Index)Table

Timecomplexityversusspacecomplexity

Naive

Slow&Easy

Page 18: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Hashing

•  WhereisGATTACAinthehumangenome?-Buildaninvertedindexofeveryk-merinthegenome

•  Howdoweaccessthetable?-Wecanonlyusenumberstoindex-Encodesequencesasnumbers Simple:A=0,C=1,G=2,T=3=>GATTACA=2,033,010Smart:A=002,C=012,G=102,T=112=>GATTACA=100011110001002=915610

•  Lookup:veryfast•  Butconstructinganoptimalhashistricky

Page 19: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Hashing

•  Numberofpossiblesequencesoflengthkis4k

•  K=7=>47=16,384(easytostore)•  K=20=>420=1,099,511,627,776(impossibletostore

directlyinRAM)-Thereareonly3B20-mersinthegenome-Evenifwecouldbuildthistable,99.7%willbeempty-Butwedon'tknowwhichcellsareemptyuntilwetry

Page 20: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Hashing

•  Numberofpossiblesequencesoflengthkis4k

•  K=7=>47=16,384(easytostore)•  K=20=>420=1,099,511,627,776(impossibletostore

directlyinRAM)-Thereareonly3B20-mersinthegenome-Evenifwecouldbuildthistable,99.7%willbeempty-Butwedon'tknowwhichcellsareemptyuntilwetry

•  Useahashfunctiontoshrinkthepossiblerange-Mapsanumbernin[0,R]tohin[0,H]

•  Use128bucketsinsteadof16,384-Division:hash(n)=H*n/R;

•  hash(GATTACA)=128*9156/16384=71-Modulo:hash(n)=n%H

•  hash(GATTACA)=9156%128=68

Page 21: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Hashing

•  Byconstruction,multiplekeyshavethesamehashvalue–  Storeelementswiththesamekeyinabucketchainedtogether

•  Agoodhashevenlydistributesthevalues:R/Hhavethesamehashvalue

–  Lookingupavaluescanstheentirebucket

Page 22: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Algorithmsforexactstringmatching

•  SearchforthesubstringGATTACAinthegenome

BruteForce SuffixArray

Highspacecomplexity

Fast(binarysearch)Trickytodevelophashfunction

Fast

Hash(Index)Table

Easy

Slow

Page 23: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

•  BowtieindexesthegenomeusingaschemebasedontheBurrows-Wheelertransform(BWT)andtheFerragina-Manzini(FM)index

Page 24: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Burrows-Wheelertransform

•  TheBWTisareversiblepermutationofthecharactersinatext•  BWT-basedindexingallowslargetextstobesearchedefficientlyin

asmallmemoryfootprint

T Burrows-Wheeler transform of T:

BWT(T)

Burrows-Wheeler matrix of T

All cyclic rotations of T$, sorted lexicographically

Same size as T

Page 25: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Lastfirst(LF)mapping

•  TheBWmatrixhasapropertycalledlastfirst(LF)mapping:TheithoccurrenceofcharacterXinthelastcolumncorrespondstothesametextcharacterastheithoccurrenceofXinthefirstcolumn

•  ThispropertyisatthecoreofalgorithmsthatusetheBWTindextosearchthetext

Rank: 2

LF property implicitly encodes the Suffix Array

Rank: 2

Page 26: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Lastfirst(LF)mapping

WecanrepeatedlyapplyLFmappingtoreconstructTfromBWT(T)UNPERMUTEalgorithm(BurrowsandWheeler,1994)

1 2 3 4 5 6

Page 27: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

LFmappingandexactmatching

EXACTMATCHalgorithm(FerraginaandManzini,2000)-calculatestherangeofmatrixrowsbeginningwithsuccessivelylongersuffixesofthequeryReference:acaacg.Query:aac

the matrix is sorted lexicographically

rows beginning with a given sequence appear consecutively

At each step, the size of the range either shrinks or remains the same

1 2 3 4 5 6

Page 28: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

LFmappingandexactmatching

the matrix is sorted lexicographically

rows beginning with a given sequence appear consecutively

At each step, the size of the range either shrinks or remains the same

Whataboutmismatches?

EXACTMATCHalgorithm(FerraginaandManzini,2000)-calculatestherangeofmatrixrowsbeginningwithsuccessivelylongersuffixesofthequeryReference:acaacg.Query:aac

Page 29: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Mismatches?

•  EXACTMATCHisinsufficientforshortreadalignmentbecausealignmentsmaycontainmismatches

•  Whatarethemaincausesformismatches?-sequencingerrors-differencesbetweenreferenceandqueryorganisms

Page 30: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie–mismatchesandbacktrackingsearch

•  EXACTMATCHisinsufficientforshortreadalignmentbecausealignmentsmaycontainmismatches

•  Bowtieconductsabacktrackingsearchtoquicklyfindalignmentsthatsatisfyaspecifiedalignmentpolicy

•  Eachcharacterinareadhasanumericqualityvalue,withlowervaluesindicatingahigherlikelihoodofasequencingerror

•  Example:IlluminausesPhredqualityscoringPhredscoreofabaseis:Qphred=-10*log10(e)whereeistheestimatedprobabilityofabasebeingwrong

•  Bowtiealignmentpolicyallowsalimitednumberofmismatchesandprefersalignmentswherethesumofthequalityvaluesatallmismatchedpositionsislow

Page 31: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie-backtrackingsearch

•  ThesearchissimilartoEXACTMATCH•  Itcalculatesmatrixrangesforsuccessivelylongerquerysuffixes

Page 32: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

LFmappingandexactmatching

EXACTMATCHalgorithm(FerraginaandManzini,2000)-calculatestherangeofmatrixrowsbeginningwithsuccessivelylongersuffixesofthequeryReference:acaacg.Query:aac

the matrix is sorted lexicographically

rows beginning with a given sequence appear consecutively

At each step, the size of the range either shrinks or remains the same

Page 33: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie-backtrackingsearch

•  ThesearchissimilartoEXACTMATCH•  Itcalculatesmatrixrangesforsuccessivelylongerquerysuffixes•  Iftherangebecomesempty(asuffixdoesnotoccurinthetext),

thenthealgorithmmayselectanalready-matchedquerypositionandsubstituteadifferentbasethere,introducingamismatchintothealignment

•  TheEXACTMATCHsearchresumesfromjustafterthesubstitutedposition

•  Thealgorithmselectsonlythosesubstitutionsthatareconsistentwiththealignmentpolicy

Page 34: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Exactsearchforggta

ggta ggta ggta

Page 35: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Search is greedy: the first valid alignment encountered by Bowtie will not necessarily be the 'best' in terms of number of mismatches or in terms of quality

Inexactsearchforggta

Page 36: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie-backtrackingsearch

•  Thisstandardalignercan,insomecases,encountersequencesthatcauseexcessivebacktracking

•  Bowtiemitigatesexcessivebacktrackingwiththenoveltechniqueofdoubleindexing–  Idea:create2indicesofthegenome:onecontainingtheBWTofthegenome,calledtheforwardindex,

andasecondcontainingtheBWTofthegenomewithitssequencereversed(notreversecomplemented)calledthemirrorindex.

•  Let’sconsideramatchingpolicythatallowsonemismatchinthealignment(eitherinthefirsthalforinthesecondhalf)

•  Bowtieproceedsintwophases:1.loadtheforwardindexintomemoryandinvokethealignerwiththeconstraintthatitmaynotsubstituteatpositionsinthequery'srighthalf2.loadthemirrorindexintomemoryandinvokethealigneronthereversedquery,withtheconstraintthatthealignermaynotsubstituteatpositionsinthereversedquery'srighthalf(theoriginalquery'slefthalf).

•  Theconstraintsonbacktrackingintotherighthalfpreventexcessivebacktracking,whereastheuseoftwophasesandtwoindicesmaintainsfullsensitivity

Page 37: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie-backtrackingsearch•  Basequalityvariesacrosstheread•  Bowtieallowstheusertoselect

-thenumberofmismatchespermittedinthehigh-qualityendofaread(default:2mismatchesinthefirst28bases)

-maximumacceptablequalityofmismatchedpositionsoverthealignment(default:70PHREDscore)

•  Thefirst28basesonthehigh-qualityendofthereadaretermedtheseed•  Theseedconsistsoftwohalves:

-the14bponthehigh-qualityend(usuallythe5'end)=thehi-half-the14bponthelow-qualityend=lo-half

•  Assuming2mismatchespermittedintheseed,areportablealignmentwillfallintooneoffourcases:1.nomismatchesinseed;2.nomismatchesinhi-half,oneortwomismatchesinlo-half3.nomismatchesinlo-half,oneortwomismatchesinhi-half4.onemismatchinhi-half,onemismatchinlo-half

Page 38: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie

•  TheBowtiealgorithmconsistsofthreephasesthatalternatebetweenusingtheforwardandmirrorindices

1. no mismatches in seed 2. no mismatches in hi-half, one or two mismatches in lo-half 3. no mismatches in lo-half, one or two mismatches in hi-half 4. one mismatch in hi-half, one mismatch in lo-half

partial alignments with mismatches only in the hi-half

extend the partial alignments into full alignments.

Page 39: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Maq:MappingandAssemblywithQualities SOAP=ShortOligonucleotideAnalysisPackage

Aligning2millionreadstothehumangenome

Page 40: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Lastfirst(LF)mapping

•  TheBWmatrixhasapropertycalledlastfirst(LF)mapping:TheithoccurrenceofcharacterXinthelastcolumncorrespondstothesametextcharacterastheithoccurrenceofXinthefirstcolumn

•  ThispropertyisatthecoreofalgorithmsthatusetheBWTindextosearchthetext

Rank: 2

Rank: 2

LF property implicitly encodes the Suffix Array

Page 41: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Constructingtheindex

•  HowdoweconstructaBWTindex?•  CalculatingtheBWTiscloselyrelatedtobuildingasuffixarray•  EachelementoftheBWTcanbederivedfromthecorrespondingelement

ofthesuffixarray:

•  Onecouldgenerateallsuffixes,sortthentoobtaintheSA,thencalculatetheBWTinasinglepassoverthesuffixarray

•  However,constructingtheentiresuffixarrayinmemoryrequiresatleast~12gigabytesforthehumangenome

•  Instead,Bowtieusesablock-wisestrategy:buildsthesuffixarrayandtheBWTblock-by-block,discardingsuffixarrayblocksoncethecorrespondingBWTblockhasbeenbuilt

•  Bowtiecanbuildthefullindexforthehumangenomeinabout24hoursinlessthan1.5gigabytesofRAM

•  If16gigabytesofRAMormoreisavailable,BowtiecanexploittheadditionalRAMtoproducethesameindexinabout4.5hours

Page 42: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Constructingtheindex

•  HowdoweconstructaBWTindex?•  CalculatingtheBWTiscloselyrelatedtobuildingasuffixarray•  EachelementoftheBWTcanbederivedfromthecorrespondingelement

ofthesuffixarray:

•  Onecouldgenerateallsuffixes,sortthentoobtaintheSA,thencalculatetheBWTinasinglepassoverthesuffixarray

•  However,constructingtheentiresuffixarrayinmemoryrequiresatleast~12gigabytesforthehumangenome

•  Instead,Bowtieusesablock-wisestrategy:buildsthesuffixarrayandtheBWTblock-by-block,discardingsuffixarrayblocksoncethecorrespondingBWTblockhasbeenbuilt

•  Bowtiecanbuildthefullindexforthehumangenomeinabout24hoursinlessthan1.5gigabytesofRAM

•  If16gigabytesofRAMormoreisavailable,BowtiecanexploittheadditionalRAMtoproducethesameindexinabout4.5hours

Page 43: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Storingtheindex

•  ThelargestsinglecomponentoftheBowtieindexistheBWTsequence.BowtiestorestheBWTina2-bit-per-baseformat

•  ABowtieindexfortheassembledhumangenomesequenceisabout1.3gigabytes

•  AfullBowtieindexactuallyconsistsofpairofequal-sizeindexes,theforwardandmirrorindexes,foranygivengenome,butitcanberunsuchthatonlyoneofthetwoindexesiseverresidentinmemoryatonce(usingthe–zoption)

•  Whataboutgaps?

Page 44: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie2

•  Bowtie:veryefficientungappedalignmentofshortreadsbasedonBWTindex

•  Index-basedalignmentalgorithmscanbequiteinefficientwhengapsareallowed

•  Gapscanresultsfrom–  sequencingerrors–  trueinsertionsanddeletions

•  Bowtie2extendstheindex-basedapproachofBowtietopermitgappedalignment

•  Itdividesthealgorithmintotwostages1.aninitial,ungappedseed-findingstagethatbenefitsfromthespeedandmemoryefficiencyofthefull-textindex2.agappedextensionstagethatusesdynamicprogrammingandbenefitsfromtheefficiencyofsingle-instructionmultiple-data(SIMD)parallelprocessingavailableonmodernprocessors

Page 45: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Bowtie2

•  Bowtie:veryefficientungappedalignmentofshortreadsbasedonBWTindex

•  Index-basedalignmentalgorithmscanbequiteinefficientwhengapsareallowed

•  Gapscanresultsfrom–  sequencingerrors–  trueinsertionsanddeletions

•  Bowtie2extendstheindex-basedapproachofBowtietopermitgappedalignment

•  Itdividesthealgorithmintotwostages1.aninitial,ungappedseed-findingstagethatbenefitsfromthespeedandmemoryefficiencyofthefull-textindex2.agappedextensionstagethatusesdynamicprogrammingandbenefitsfromtheefficiencyofsingle-instructionmultiple-data(SIMD)parallelprocessingavailableonmodernprocessors

Page 46: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

BWA

Page 47: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which
Page 48: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Solution: •  sequential maximum mappable seed search in uncompressed suffix arrays •  followed by seed clustering and stitching procedure.

“Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the •  non-contiguous transcript structure, •  relatively short read lengths and •  constantly increasing throughput of the sequencing technologies.”

“Currently available RNA-seq aligners suffer from •  high mapping error rates, •  low mapping speed, •  read length limitation and •  mapping biases.”

Page 49: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-searchforGATTACA

Lo = 9; Hi = 9

Lo Hi

Middle = Suffix[9] = GATTACAG…

Compare GATTACA to GATTACAG… => Match

Return: match at position 2

Mid = (9+9)/2 = 9

Page 50: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

Suffixarrays-searchforGATTACA

Lo = 9; Hi = 11

Lo

Hi

Middle = Suffix[10] = GATTACC

Compare GATTACA to GATTACC => Lower

Hi = Mid - 1

Mid = (9+11)/2 = 10

Page 51: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

MaximumMappablePrefix(MMP)search

•  Using uncompressed suffix arrays leads to increased speed (compared to BWT) •  This speed advantage is traded off against the increased memory usage

Page 52: Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... · - Even if we could build this table, 99.7% will be empty - But we don't know which

MaximumMappablePrefix(MMP)search

•  Using uncompressed suffix arrays leads to increased speed (compared to BWT) •  This speed advantage is traded off against the increased memory usage