Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... ·...

ShortReadAlignmentAlgorithms

RalucaGordân

DepartmentofBiostatisticsandBioinformaticsDepartmentofComputerScience

DepartmentofMolecularGeneticsandMicrobiologyCenterforGenomicandComputationalBiology

DukeUniversity

July23,2018

Sequencingapplications

ChIP-seq: identify and measure significant peaks GAAATTTGC GGAAATTTG CGGAAATTT CGGAAATTT TCGGAAATT CTATCGGAAA CCTATCGGA TTTGCGGT GCCCTATCG AAATTTGC…CC GCCCTATCG AAATTTGC ATAC……CCATAGGCTATATGCGCCCTATCGGAAATTTGCGGTATAC…

Genotyping: identify variations GGTATAC……CCATAG TATGCGCCC CGGAAATTT CGGTATAC…CCAT CTATATGCG TCGGAAATT CGGTATAC…CCAT GGCTATATG CTATCGGAAA GCGGTATA…CCA AGGCTATAT CCTATCGGA TTGCGGTA C……CCA AGGCTATAT GCCCTATCG TTTGCGGT C……CC AGGCTATAT GCCCTATCG AAATTTGC ATAC……CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC……CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

RNA-seq – this course

Heterochromatin

Transcripts

Transcription

Translation

Alternative splicing

Euchromatin

Three-dimensionalarrangement ofchromatin5C or ChIA-PET

DNA methylation

MeDIP-seqRRBSMethylC-seq

Nascent transcriptbound to RNA PolII

GRO-SeqNET-seq

Quantification ofmature transcriptsand small RNA

RNA-seq

Transcription factor-binding sites

ChIP-seq

Histonemodifications

ChIP-seq

Open or accessibleregions in euchromatin

FAIREDNAse-seq

lncRNA–chromatininteraction

ChIRP-seq

Cancer

Alternate splice variants

RNA-seq

Translational efficiency

Ribo-seq

ONH

H3C

NH2

N

Biomarkers Personal omics profiling

Repressivehistone marks,e.g., H3K27me3

Activatinghistone marks,e.g., H3K4me3

Activatinghistone marks,e.g., H3K27ac

Figure 2 Sequencing technologies and their uses. Various NGS methods can precisely map and quantify chromatin features, DNA modifications and several specific stepsin the cascade of information from transcription to translation. These technologies can be applied in a variety of medically relevant settings, including uncovering regulatorymechanisms and expression profiles that distinguish normal and cancer cells, and identifying disease biomarkers, particularly regulatory variants that fall outside of protein-coding regions. Together, these methods can be used for integrated personal omics profiling to map all regulatory and functional elements in an individual. Using this basalprofile, dynamics of the various components can be studied in the context of disease, infection, treatment options, and so on. Such studies will be the cornerstone ofpersonalized and predictive medicine.

High-throughput sequencingWW Soon et al

4 Molecular Systems Biology 2013 & 2013 EMBO and Macmillan Publishers Limited

Soon, Hariharan, Snyder Molecular Systems Biology 2012

ATAC-seq

Sequencingtechnologies

https://en.wikipedia.org/wiki/DNA_sequencing

TECHNICALBIASES!

Sequencealignment

Heuristiclocalalignment(BLAST)

•  INPUT:–  Database

–  Query:PSKMQRGIKWLLP•  OUTPUT:

–  sequencessimilartoquery

Global/localalignment(Needleman-Wunsch,Smith-Waterman)

X = x1x2…xm Y = y1y2…yn

•  INPUT:–  Twosequences

•  OUTPUT:–  OptimalalignmentbetweenXandY(orsubstringsofXandY)

Shortreadalignment

•  INPUT:–  Afewmillionshortreads,withcertainerrorcharacteristics(specifictothesequencingplatform)

–  Illumina:fewerrors,mostlysubstitutions–  Areferencegenome

•  OUTPUT:–  Alignmentsofthereadstothereferencegenome

•  CanweuseBLAST?

•  Algorithmsforexactstringmatchingaremoreappropriate

•  AssumingBLASTreturnstheresultforareadin1sec•  For10millionreads:10millionseconds=116days

Algorithmsforexactstringmatching

•  SearchforthesubstringANAinthestringBANANA

BruteForce SuffixArray

PacBioAligner(BLASR);Bowtie

BinarySearch

BLAST,BLAT

Seed-and-extend

Hash(Index)Table

Timecomplexityversusspacecomplexity

Naive

Slow&Easy

BruteforcesearchforGATTACA

•  WhereisGATTACAinthehumangenome?

No match at offset 1

Match at offset 2

No match at offset 3...

BruteforcesearchforGATTACA

•  Simple,easytounderstand•  Analysis

–Genomelength=n=3,000,000,000–Querylength=m=7–Comparisons:(n-m+1)*m=21,000,000,000

•  Assumingeachcomparisontakes1/1,000,000ofasecond…•  …thetotalrunningtimeis21,000seconds=0.24days•  …forone7-bpread

SuffixarraysSuffixarray

•  Preprocessthegenome–  Sortallthesuffixesofthegenome

Split into suffixes Sort suffixes alphabetically

•  Usebinarysearch

Suffixarrays-searchforGATTACA

Lo = 1; Hi = 15 Lo

Hi

Middle = Suffix[8] = CC

Compare GATTACA to CC => Higher

Lo = Mid + 1

Mid = (1+15)/2 = 8


Lo = 9; Hi = 15

Lo

Hi

Middle = Suffix[12] = TACC

Compare GATTACA to TACC => Lower

Hi = Mid - 1

Mid = (9+15)/2 = 12


Lo = 9; Hi = 11

Lo

Hi

Middle = Suffix[10] = GATTACC

Compare GATTACA to GATTACC => Lower

Hi = Mid - 1

Mid = (9+11)/2 = 10


Lo = 9; Hi = 9

Lo Hi

Middle = Suffix[9] = GATTACAG…

Compare GATTACA to GATTACAG… => Match

Return: match at position 2

Mid = (9+9)/2 = 9

Suffixarrays-analysis

•  Word(query)ofsizem=7•  Genomeofsizen=3,000,000,000•  Bruceforce:

–  approx.mxn=21,000,000,000comparisons•  Suffixarrays:

–  approx.mxlog2(n)=7x32=224comparisons

•  Assumingeachcomparisontakes1/1,000,000ofasecond…•  …thetotalrunningtimeis0.000224secondsforone7-bpread•  Comparedto0.24daysforone7-bpreadinthecaseofbruteforcesearch

•  For10millionreads,thesuffixarraysearchwouldtake2240seconds=37minutes

Suffixarrays-analysis

•  Word(query)ofsizem=7•  Genomeofsizen=3,000,000,000

•  For10millionreads,thesuffixarraysearchwouldtake2240seconds=37minutes

•  Problem?

•  Forthehumangenome: 4.5billionbillioncharacters!!!

•  Totalcharactersinallsuffixescombined:1+2+3+…+n=


n(n+1)/2


•  SearchforthesubstringANAinthestringBANANA


PacBioAligner(BLASR);Bowtie

BinarySearch

BLAST,BLAT

Seed-and-extend

Hash(Index)Table


Naive

Slow&Easy

Hashing

•  WhereisGATTACAinthehumangenome?-Buildaninvertedindexofeveryk-merinthegenome

•  Howdoweaccessthetable?-Wecanonlyusenumberstoindex-Encodesequencesasnumbers Simple:A=0,C=1,G=2,T=3=>GATTACA=2,033,010Smart:A=002,C=012,G=102,T=112=>GATTACA=100011110001002=915610

•  Lookup:veryfast•  Butconstructinganoptimalhashistricky

Hashing

•  Numberofpossiblesequencesoflengthkis4k

•  K=7=>47=16,384(easytostore)•  K=20=>420=1,099,511,627,776(impossibletostore

directlyinRAM)-Thereareonly3B20-mersinthegenome-Evenifwecouldbuildthistable,99.7%willbeempty-Butwedon'tknowwhichcellsareemptyuntilwetry

Hashing

•  Numberofpossiblesequencesoflengthkis4k

•  K=7=>47=16,384(easytostore)•  K=20=>420=1,099,511,627,776(impossibletostore

directlyinRAM)-Thereareonly3B20-mersinthegenome-Evenifwecouldbuildthistable,99.7%willbeempty-Butwedon'tknowwhichcellsareemptyuntilwetry

•  Useahashfunctiontoshrinkthepossiblerange-Mapsanumbernin[0,R]tohin[0,H]

•  Use128bucketsinsteadof16,384-Division:hash(n)=H*n/R;

•  hash(GATTACA)=128*9156/16384=71-Modulo:hash(n)=n%H

•  hash(GATTACA)=9156%128=68

Hashing

•  Byconstruction,multiplekeyshavethesamehashvalue–  Storeelementswiththesamekeyinabucketchainedtogether

•  Agoodhashevenlydistributesthevalues:R/Hhavethesamehashvalue

–  Lookingupavaluescanstheentirebucket


•  SearchforthesubstringGATTACAinthegenome


Highspacecomplexity

Fast(binarysearch)Trickytodevelophashfunction

Fast

Hash(Index)Table

Easy

Slow

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

•  BowtieindexesthegenomeusingaschemebasedontheBurrows-Wheelertransform(BWT)andtheFerragina-Manzini(FM)index

Burrows-Wheelertransform

•  TheBWTisareversiblepermutationofthecharactersinatext•  BWT-basedindexingallowslargetextstobesearchedefficientlyin

asmallmemoryfootprint

T Burrows-Wheeler transform of T:

BWT(T)

Burrows-Wheeler matrix of T

All cyclic rotations of T$, sorted lexicographically

Same size as T

Lastfirst(LF)mapping

•  TheBWmatrixhasapropertycalledlastfirst(LF)mapping:TheithoccurrenceofcharacterXinthelastcolumncorrespondstothesametextcharacterastheithoccurrenceofXinthefirstcolumn

•  ThispropertyisatthecoreofalgorithmsthatusetheBWTindextosearchthetext

Rank: 2

LF property implicitly encodes the Suffix Array

Rank: 2


WecanrepeatedlyapplyLFmappingtoreconstructTfromBWT(T)UNPERMUTEalgorithm(BurrowsandWheeler,1994)

1 2 3 4 5 6

LFmappingandexactmatching

EXACTMATCHalgorithm(FerraginaandManzini,2000)-calculatestherangeofmatrixrowsbeginningwithsuccessivelylongersuffixesofthequeryReference:acaacg.Query:aac

the matrix is sorted lexicographically

rows beginning with a given sequence appear consecutively

At each step, the size of the range either shrinks or remains the same

1 2 3 4 5 6





Whataboutmismatches?


Mismatches?

•  EXACTMATCHisinsufficientforshortreadalignmentbecausealignmentsmaycontainmismatches

•  Whatarethemaincausesformismatches?-sequencingerrors-differencesbetweenreferenceandqueryorganisms

Bowtie–mismatchesandbacktrackingsearch

•  EXACTMATCHisinsufficientforshortreadalignmentbecausealignmentsmaycontainmismatches

•  Bowtieconductsabacktrackingsearchtoquicklyfindalignmentsthatsatisfyaspecifiedalignmentpolicy

•  Eachcharacterinareadhasanumericqualityvalue,withlowervaluesindicatingahigherlikelihoodofasequencingerror

•  Example:IlluminausesPhredqualityscoringPhredscoreofabaseis:Qphred=-10*log10(e)whereeistheestimatedprobabilityofabasebeingwrong

•  Bowtiealignmentpolicyallowsalimitednumberofmismatchesandprefersalignmentswherethesumofthequalityvaluesatallmismatchedpositionsislow

Bowtie-backtrackingsearch

•  ThesearchissimilartoEXACTMATCH•  Itcalculatesmatrixrangesforsuccessivelylongerquerysuffixes


•  ThesearchissimilartoEXACTMATCH•  Itcalculatesmatrixrangesforsuccessivelylongerquerysuffixes•  Iftherangebecomesempty(asuffixdoesnotoccurinthetext),

thenthealgorithmmayselectanalready-matchedquerypositionandsubstituteadifferentbasethere,introducingamismatchintothealignment

•  TheEXACTMATCHsearchresumesfromjustafterthesubstitutedposition

•  Thealgorithmselectsonlythosesubstitutionsthatareconsistentwiththealignmentpolicy

Exactsearchforggta

ggta ggta ggta

Search is greedy: the first valid alignment encountered by Bowtie will not necessarily be the 'best' in terms of number of mismatches or in terms of quality

Inexactsearchforggta


•  Thisstandardalignercan,insomecases,encountersequencesthatcauseexcessivebacktracking

•  Bowtiemitigatesexcessivebacktrackingwiththenoveltechniqueofdoubleindexing–  Idea:create2indicesofthegenome:onecontainingtheBWTofthegenome,calledtheforwardindex,

andasecondcontainingtheBWTofthegenomewithitssequencereversed(notreversecomplemented)calledthemirrorindex.

•  Let’sconsideramatchingpolicythatallowsonemismatchinthealignment(eitherinthefirsthalforinthesecondhalf)

•  Bowtieproceedsintwophases:1.loadtheforwardindexintomemoryandinvokethealignerwiththeconstraintthatitmaynotsubstituteatpositionsinthequery'srighthalf2.loadthemirrorindexintomemoryandinvokethealigneronthereversedquery,withtheconstraintthatthealignermaynotsubstituteatpositionsinthereversedquery'srighthalf(theoriginalquery'slefthalf).

•  Theconstraintsonbacktrackingintotherighthalfpreventexcessivebacktracking,whereastheuseoftwophasesandtwoindicesmaintainsfullsensitivity

Bowtie-backtrackingsearch•  Basequalityvariesacrosstheread•  Bowtieallowstheusertoselect

-thenumberofmismatchespermittedinthehigh-qualityendofaread(default:2mismatchesinthefirst28bases)

-maximumacceptablequalityofmismatchedpositionsoverthealignment(default:70PHREDscore)

•  Thefirst28basesonthehigh-qualityendofthereadaretermedtheseed•  Theseedconsistsoftwohalves:

-the14bponthehigh-qualityend(usuallythe5'end)=thehi-half-the14bponthelow-qualityend=lo-half

•  Assuming2mismatchespermittedintheseed,areportablealignmentwillfallintooneoffourcases:1.nomismatchesinseed;2.nomismatchesinhi-half,oneortwomismatchesinlo-half3.nomismatchesinlo-half,oneortwomismatchesinhi-half4.onemismatchinhi-half,onemismatchinlo-half

Bowtie

•  TheBowtiealgorithmconsistsofthreephasesthatalternatebetweenusingtheforwardandmirrorindices

1. no mismatches in seed 2. no mismatches in hi-half, one or two mismatches in lo-half 3. no mismatches in lo-half, one or two mismatches in hi-half 4. one mismatch in hi-half, one mismatch in lo-half

partial alignments with mismatches only in the hi-half

extend the partial alignments into full alignments.

Maq:MappingandAssemblywithQualities SOAP=ShortOligonucleotideAnalysisPackage

Aligning2millionreadstothehumangenome


•  TheBWmatrixhasapropertycalledlastfirst(LF)mapping:TheithoccurrenceofcharacterXinthelastcolumncorrespondstothesametextcharacterastheithoccurrenceofXinthefirstcolumn

•  ThispropertyisatthecoreofalgorithmsthatusetheBWTindextosearchthetext

Rank: 2

Rank: 2

LF property implicitly encodes the Suffix Array

Constructingtheindex

•  HowdoweconstructaBWTindex?•  CalculatingtheBWTiscloselyrelatedtobuildingasuffixarray•  EachelementoftheBWTcanbederivedfromthecorrespondingelement

ofthesuffixarray:

•  Onecouldgenerateallsuffixes,sortthentoobtaintheSA,thencalculatetheBWTinasinglepassoverthesuffixarray

•  However,constructingtheentiresuffixarrayinmemoryrequiresatleast~12gigabytesforthehumangenome

•  Instead,Bowtieusesablock-wisestrategy:buildsthesuffixarrayandtheBWTblock-by-block,discardingsuffixarrayblocksoncethecorrespondingBWTblockhasbeenbuilt

•  Bowtiecanbuildthefullindexforthehumangenomeinabout24hoursinlessthan1.5gigabytesofRAM

•  If16gigabytesofRAMormoreisavailable,BowtiecanexploittheadditionalRAMtoproducethesameindexinabout4.5hours

Storingtheindex

•  ThelargestsinglecomponentoftheBowtieindexistheBWTsequence.BowtiestorestheBWTina2-bit-per-baseformat

•  ABowtieindexfortheassembledhumangenomesequenceisabout1.3gigabytes

•  AfullBowtieindexactuallyconsistsofpairofequal-sizeindexes,theforwardandmirrorindexes,foranygivengenome,butitcanberunsuchthatonlyoneofthetwoindexesiseverresidentinmemoryatonce(usingthe–zoption)

•  Whataboutgaps?

Bowtie2

•  Bowtie:veryefficientungappedalignmentofshortreadsbasedonBWTindex

•  Index-basedalignmentalgorithmscanbequiteinefficientwhengapsareallowed

•  Gapscanresultsfrom–  sequencingerrors–  trueinsertionsanddeletions

•  Bowtie2extendstheindex-basedapproachofBowtietopermitgappedalignment

•  Itdividesthealgorithmintotwostages1.aninitial,ungappedseed-findingstagethatbenefitsfromthespeedandmemoryefficiencyofthefull-textindex2.agappedextensionstagethatusesdynamicprogrammingandbenefitsfromtheefficiencyofsingle-instructionmultiple-data(SIMD)parallelprocessingavailableonmodernprocessors

Solution: •  sequential maximum mappable seed search in uncompressed suffix arrays •  followed by seed clustering and stitching procedure.

“Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the •  non-contiguous transcript structure, •  relatively short read lengths and •  constantly increasing throughput of the sequencing technologies.”

“Currently available RNA-seq aligners suffer from •  high mapping error rates, •  low mapping speed, •  read length limitation and •  mapping biases.”


Lo = 9; Hi = 9

Lo Hi

Middle = Suffix[9] = GATTACAG…

Compare GATTACA to GATTACAG… => Match

Return: match at position 2

Mid = (9+9)/2 = 9


Lo = 9; Hi = 11

Lo

Hi

Middle = Suffix[10] = GATTACC

Compare GATTACA to GATTACC => Lower

Hi = Mid - 1

Mid = (9+11)/2 = 10

MaximumMappablePrefix(MMP)search

•  Using uncompressed suffix arrays leads to increased speed (compared to BWT) •  This speed advantage is traded off against the increased memory usage

Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... ·...

Documents

Transcript of Short Read Alignment Algorithmspeople.duke.edu/~ccc14/duke-hts-2018/_downloads/0723_ShortRead... ·...