Algorithms for high-quality mapping of NGS reads Paolo...

51
Algorithms for high-quality mapping of NGS reads Paolo Ribeca Algorithm Development Centro Nacional de Análisis Genómico, Barcelona Bioinformatics for Omics Sciences, Napoli, 26.09.2012

Transcript of Algorithms for high-quality mapping of NGS reads Paolo...

Page 1: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Algorithms for high-quality mappingof NGS reads

Paolo RibecaAlgorithm DevelopmentCentro Nacional de Análisis Genómico, Barcelona

Bioinformatics for Omics Sciences, Napoli, 26.09.2012

Page 2: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Our HTS setupOur HTS setupThe Spanish National Sequencing Center

Core funding 2010­2012:

15 M€ from national government

15 M€ from Catalan government

12 Illumina HiSeqs2000, 2 GAIIx, 1 MiSeq

~1 Tb produced per day

Our dedicated cluster has 1200 cores

2,5 PB of dedicated storage

About 50 people (100 in the future), the half of them being bioinformaticians.

ICGC project, plus many others.

Page 3: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

A problem of data & software scaling

Throughput constantly improves much more than computing power

• The first Solexa Genome Analyzer @ CRG, 2008: 3 Gb / run

and at the beginning (2010), the CNAG used to have 12 GA IIx: 30 Gb / run each.

• Current Illumina HiSeq 2000 machines produce > 500 Gb / run

after some relatively minor hardware upgrade, which gave a 3x boost.

The CNAG has 12 of them.

• The CNAG has a peak sequencing capacity of ~ 1 Tb / day... while the one originally planned in 2010 was 50-100 Gb / day.

• ...but still “only” 1000 processors in its cluster

which is the amount originally intended to process 50-100 Gb / day.

• Need for more optimized algorithms!

Page 4: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

But what is this HTS anyway?But what is this HTS anyway?Actually, a large body of completely different technologies.

Single­molecule sequencing, or cluster­based (amplification­dependent) sequencing.

Sequences can be read optically (fluorescence, laser) or electrically.

Only one end of the molecule is sequenced (single­end, SE) or both (paired­end, PE).Or even strobed sequencing, with 4 or more chunks.

Sequencing errors can be very different (homopolymers or substitutions).

Typical platforms are:

Illumina/Solexa

ABI/SOLiD

Roche/454.

Page 5: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads
Page 6: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads
Page 7: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads
Page 8: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Practical problemsPractical problems¡Having sequencers ¡Having sequencers ≠ ≠ producing good scientific data!producing good scientific data!

Many small research groups buy sequencers just to find out that they produce a flood of meaningless data.

Keeping the machines idle is very expensive.

At least two places where expertise is essential:

Sample preparation/Protocols

Downstream analysis protocols.

A horde of different biological protocols for data production are commonly used: DNA­seq, RNA­seq, ChIP­seq, methylation/epigenetic studies, etc..For each one, a different analysis protocol is required!Protocols are platform­dependent!

Still a long way to go before good protocols are establishedStill a long way to go before good protocols are established.

Page 9: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

What can be done with HTSWhat can be done with HTS

DNA (whole­genome, targeted, exome) resequencing: YES

SNP calling: YES

Variant calling: SOMETIMES

DNA shotgun sequencing & de­novo assembly: NO

         (with fosmid pooled sequencing: SOMETIMES/YES)

RNA quantification:

Gene­level: SOMETIMES/YES

Isoform­level: NO

… and so on.

What should be the job of an Algorithm Development group?

Page 10: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Practical computational problemsPractical computational problemsTwo main categories of problems:

Technological problems deriving from short reads:

Mapping (that is, HT alignment)

Assembly

Most of the problems are platform­dependent!

Problems deriving from the huge amount of data (storage!).

Some of the technological problems will be the subject of my talks.

Size­related problems are trickier: sequencing vs. storage.

Although highly correlated from cell to cell, the information content of a human body is huge (6 Gbit/cell)

Clonal subpopulations. Microbiome.

Is each biological sample really unique?

Page 11: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

HT AlignmentHT Alignment

Page 12: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

General setupGeneral setupResequencing problem.The general assumption is that (a reasonable approximation of) the reference genome (for instance, hg19) is known.

This is often not the case (new species, metagenomics).

When this is the case, many interesting biological situations:

DNA/exome resequencing (SNPs & structural variations):domestication, agronomy, cancer, genetic diseases/disorders, genotyping, personalized medicine

RNA, expression quantification

Epigenomics/chromatine structure & regulation (ChIP­seq, histone modifications, micrococcal assays for nucleosomes)

Some technical issues : mappability.

Page 13: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The problem of short read alignmentThe problem of short read alignmentAlignment to a reference in a HTS setup:

Many short reads (billions) to be matched against...

[850 GB produced per day, 100­150 PE nt reads: 3­4 Gr/day]

Very long references (many GB), with...

Several mismatches (typically one each 25 nt), and...

Problem­dependent alignment! RNA requires spliced alignment, bisulfite requires a modified reference, etc..

With those specifications, “old” tools like BLAST do not offer adequate performance anymoreWith those specifications, new tools might not offer adequate accuracy anymore (many indels, RNA editing, etc.). 

¡Speed is not the only important thing!¡Speed is not the only important thing!Often, tradeoff between speed and quality.

Page 14: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Is that really all there is to it?Is that really all there is to it?Many additional requirements:

Qualities

Paired­end/Mate­pair

Bisulfite

Realignment

Platform­dependent problems!

Homopolymers (Roche/454)

Colorspace (ABI/SOLiD, Helicos)

Quartets (Complete Genomics)

Very long reads with very many indels (Pacific Biosciences)

Storage­/pipeline­/data manipulation­ related problems.

Page 15: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Qualities:@EAS131_1_FC30C5KAAXX/1GAGTTTCCTCCTGCAGATGTGAACTGTGTAAATAGTCAGAACTGATCGA+EAS131_1_FC30C5KAAXX/1aabaabbaaaaaaaaaaaaaaaabbaaaaaabaaaZaaaaaaUKZEUaZ

Paired­end/Mate­pair:ACGTTTTCAGACAGA­­­­­­­­­­­­­­­­­­­­ACGATACTAGATCA

Bisulfite: unmeth. C => U, that is TACGTTCTTCAGAC => ACGTTTTTCAGAT or ATCTGAAAAACGT

Realignment:ACGTTTTCGATGATAGAT    ACGTTTTCGATGATAGAT  GTTTTCGTGATAGAT  =>   GTTTTCG­TGATAGAT    TTTCGTGATAGAT         TTTCG­TGATAGAT

Colorspace:  A C G TA 0 1 2 3C 1 0 3 2     AGTCTATCTC => A212233222G 2 3 0 1T 3 2 1 0

Page 16: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Implementor's and user's standpoints Implementor's and user's standpoints Several possible indexing techniques, with flames in the literature and claims of being “superior”. Typically used:

Hash tables PROS: Searches with many errors easierCONS: Slower, error rate is pre­determined, bulky

Some data structure like the Burrows­Wheeler TransformPROS: Very fast exact searches, no hardcoded parametersCONS: Searches with errors more difficult.

What to index?

Pre­indexing the genome and scanning the reads

Pre­indexing the reads and scanning the genome

Pre­indexing both, and doing merge­sort.

From the point of view of the user, all of that is quite irrelevant. What matters is the requirements from the problem (number of mismatches/precision needed) and the quality of obtained results.

Page 17: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

User's StandpointUser's Standpoint

Page 18: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The short-read mapping problemNot as simple as just downloading software, and running it on your data

• There is no such a simple thing like “finding good matches” since there are many possible more or less accepted ways of defining what should be considered “good”.

• Ideally, “good” matches are defined in terms of some string distance... which allows to compare the query and the match, and decide how “good” is the match.

• ...whose definition might be very, very complicated (think about “classical” biological alignment à la BLAST).

While mapping, however, simpler definitions are usually accepted, typically:

Page 19: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The short-read mapping problemNot as simple as just downloading software, and running it on your data

• There is no such a simple thing like “finding good matches” since there are many possible more or less accepted ways of defining what should be considered “good”.

• Ideally, “good” matches are defined in terms of some string distance... which allows to compare the query and the match, and decide how “good” is the match.

• ...whose definition might be very, very complicated (think about “classical” biological alignment à la BLAST).

While mapping, however, simpler definitions are usually accepted, typically:

• Hamming distance (the number of mismatches)

Page 20: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The short-read mapping problemNot as simple as just downloading software, and running it on your data

• There is no such a simple thing like “finding good matches” since there are many possible more or less accepted ways of defining what should be considered “good”.

• Ideally, “good” matches are defined in terms of some string distance... which allows to compare the query and the match, and decide how “good” is the match.

• ...whose definition might be very, very complicated (think about “classical” biological alignment à la BLAST).

While mapping, however, simpler definitions are usually accepted, typically:

• Hamming distance (the number of mismatches)

• Levenshtein distance (the number of edit operations; modified definitions are common, though).

Page 21: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The short-read mapping problemNot as simple as just downloading software, and running it on your data

• There is no such a simple thing like “finding good matches” since there are many possible more or less accepted ways of defining what should be considered “good”.

• Ideally, “good” matches are defined in terms of some string distance... which allows to compare the query and the match, and decide how “good” is the match.

• ...whose definition might be very, very complicated (think about “classical” biological alignment à la BLAST).

While mapping, however, simpler definitions are usually accepted, typically:

• Hamming distance (the number of mismatches)

• Levenshtein distance (the number of edit operations; modified definitions are common, though).

In general, the need for fast mapping usually implies more relaxed accuracy requirements (sometimes justified).

Page 22: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The “iceberg model” for mappings

The matches you find are just the tip of the iceberg

• In general, given a larger and larger distance, one will find more and more matches eventually being able to align to all the locations in the reference.

This is alike to a submerged iceberg: the tip is the query, and the part above the surface the results of the search.

The more permissive the parameters of the search, the larger the part of the iceberg tipping out of the water.

• A “stratum” is a set of matches having the same distance from the query that is, a layer of the iceberg made of matches which all have the same distance from its tip.

One can output only 1 stratum (just the best matches), 2 strata (optimal + best suboptimal), and so on.

Most mappers can access only the best stratum. This influences the definition of uniqueness.

• The number of matches in successive strata grows depending on the read for instance, depending on its degree of repetitiveness. This is portrayed by the “slant” of the iceberg tip.

Page 23: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads
Page 24: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Implementor's StandpointImplementor's Standpoint

Page 25: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Hash tables IHash tables IBasic idea: one can map strings to numbers via a hashing function H.One example is md5sum.My file becomes 28831fa678f075862b9da748802fcba7.

Finding good (clash­free) hashes for big strings is difficult.

Here we have small strings... Simple solution: just encode the string.For instance,  C G T A C G A T A C G G G G       =>  1 2 3 0 1 2 0 3 0 1 2 2 2 2       => 0110110001100011000110101010

We can take advantage of bit­parallelism tricks to do things faster(k=32 bases could fit into a single 64­bit word).

We install all the reads in a hash table of size S: read R goes to a bucket corresponding to position H[R] mod S (S being a “small” number like 10,000,000). We scan, say, the genome (sliding window of 32) checking the hash for the k­mer appearing at each position.

Page 26: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Hash tables IIHash tables IIHow to accommodate mismatches? Spaced seeds/filters.For instance:111111111111100000000000000000000000000011111111111110000000000000000000000000000000001111111111111111111100000011111100000000000000(1=match,0=don't care) can detect up to two mismatches for reads of length up to 33.Indels are more complicated.

Major problems:

k difficult to change

the hash tables (mismatches => many of them) are bulky.

The scheme can be extended to index both genome and queries:merge­sort of the encoded k­mers & positions.

Page 27: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

String data structuresString data structuresData structures for string manipulation (SDS) are becoming more and more essential: Internet! (and biology).

Due to several reasons, they never really entered the mainstream of computer science before the invention of compressed SDS.

For instance, suffix trees were invented in 1973, but it took them >20 years to become widely known.

Structures relevant to this talk:

Suffix trees (the “mother” of all structures: virtually any kind of fast string operation after a slow preprocessing)

Suffix arrays (simpler and faster than suffix trees, roughly equivalent with the help of some auxiliary data structure)

Compressible structures based on text transforms (Burrows­Wheeler transform and Ferragina­Manzini index).

Page 28: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Suffix Tree IThe Suffix Tree IThe main idea is always that of creating a list of all suffixes/prefixes present in the string.

(1) Sort lexicographically (2) Identify common blocks (3) Build tree(Nodes must have degree ≥ 2; we are omitting suffix links, needed to stream text).

BANANA$, 0ANANA$, 1NANA$, 2ANA$, 3NA$, 4A$, 5$, 6

Page 29: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Suffix Tree IIThe Suffix Tree IIWeiner (1973), McCreight (1976), Ukkonen (1995), Gusfield.

Suffix trees show nice algorithmic properties (after a long preprocessing phase, OK for genomes):

They can be built in linear time

Text can be reconstructed if we include suffix links

Pattern matching on P with k mismatches can be done in time O(k length(P) + Nocc) (using matching statistics)

(Super)maximal repeats can be found in O(n + Nocc) time

Many other things —like finding maximal unique matches for genome­wide alignments— are possible

(see Gusfield’s book for more details, some 100 pages are dedicated to describe what can be done with STs in the context of biology).

Page 30: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Suffix Array IThe Suffix Array IConceptually a stripped version of the suffix tree.It is defined as the sequence of the starting position of suffixes.

For instance:  T = (B,A,N,A,N,A,$)

   0 BANANA$      6 $BANANA   1 ANANA$B      5 A$BANAN   2 NANA$BA      3 ANA$BAN   3 ANA$BAN  =>  1 ANANA$B   4 NA$BANA      0 BANANA$   5 A$BANAN      4 NA$BANA   6 $BANANA      2 NANA$BA

Hence SABANANA$ = (6,5,3,1,0,4,2).

The suffix array is always a permutation of (0, … , n), but not all 

permutations are suffix arrays: if A alphabet, eventually n!  > An.

Page 31: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Suffix Array IIThe Suffix Array IIManber and Myers (1991), Kärkkäinen and Sanders (2003).

Suffix arrays show nice algorithmic properties too:

They too can be built in linear time and linear space

Exact pattern matching is still fast and much simpler,being now reduced to a range query

Several other operations permitted by suffix trees can still be performed (but some, notably text reconstruction, cannot)

Various string manipulations are still possible if the data structure is augmented with additional information (like the lcp­intervals [“lcp” = longest common prefix], which allow to simulate a bottom­up walk of the tree, Kasai)

Other operations can be simulated as well (Abouelhoda, Kurtz and Ohlebusch).

Page 32: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Succinct SDSSuccinct SDSIn recent years (Burrows & Wheeler 1994, Ferragina & Manzini 2001, Sadakane 2002, Grossi, Vitter, Navarro, Mäkinen, Kärkkäinen, etc.) people realized that one can store something much smaller:

Burrows & Wheeler: first compressible reversible transformation of the text, the BWT

Ferragina & Manzini: first practical index (the FM­index) based on the BWT

Sadakane: the index can be turned into a self­index which allows to reconstruct the text without the text itself.

A practical implementation might store samples of SA, SA–1  and ranks at regular intervals.

The compression rate is usually bound by the entropy of the text (similarly to the BWT).

Page 33: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Burrows­Wheeler Transform IThe Burrows­Wheeler Transform IDefinition:  BWT[i] = T[SA[i] – 1].For instance:  T = (B,A,N,A,N,A,$)  0 BANANA$      6 $BANANA      5      4      1  1 ANANA$B      5 A$BANAN      4      3      5  2 NANA$BA      3 ANA$BAN      2      6      6  3 ANA$BAN  =>  1 ANANA$B  =>  0  =>  2  =>  4  4 NA$BANA      0 BANANA$      6      5      0  5 A$BANAN      4 NA$BANA      3      1      2  6 $BANANA      2 NANA$BA      1      0      3

Hence BWTBANANA$ = (A,N,N,B,$,A,A).

“Cyclicize suffixes, take last column”.

The BWT is not by itself a compressed structure, but it can be efficiently compressed by run­length encoding (RLE):it tends to group together identical “contexts”, and hence identical letters tend to be adjacent.

Page 34: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Burrows­Wheeler Transform IIThe Burrows­Wheeler Transform IIOf great importance in this context is the LF­function, defined as                 LF[i] = SA−1[SA[i] − 1]          LFBANANA$ = (1,5,6,4,0,2,3)

(another permutation, but a “well­behaved” one with many “runs” of consecutive numbers: easy to compress).Sadakane and other authors use the function Ã instead, which in fact is the inverse of LF being defined as Ã[i]=SA−1[SA[i] + 1].

If we store the BWT and LF n  (which can be sampled one position in samplerate ones, at the arbitrarily small space cost of 2n/samplerate bits) we can avoid storing SA, SA−1 and Ã:          

Depending on the implementation, each retrieval of SA, SA−1 or Ã will then be possible in O(samplerate) or O(A samplerate).

Page 35: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Burrows­Wheeler Transform IIIThe Burrows­Wheeler Transform IIIThe fundamental relation holds          LF[i] = counts[BWT[i]] + occs[BWT[i], i]where counts is the number of times each letter occurs in the BWT:          countsBANANA$[(Ø,$,A,B,N)] = (0,1,4,5,7).

We can do very fast pattern matching by backward search (Burrows and Wheeler) at the price of two “LF­like” computations per step:lookup[P]:  lo = 0, hi = n  for (i = length[P]; i > 0; ­­i):    lo = counts[P[i]] + occs[P[i],lo]    hi = counts[P[i]] + occs[P[i],hi]  return hi – lo.This is the basic reason why the BWT is a very efficient method to store and query a text.

Page 36: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The Ferragina­Manzini indexThe Ferragina­Manzini indexA bit of Italian magic :) :the BWT information can be stored efficiently.

Wavelet trees are a way of encoding the BWT using n log2A bits.They are perfect binary trees, with nodes corresponding to progressively halved subsets of A.   BANANA$   0010101   0~[AB]       1~[N$]   1000001   0~[A] 1~[B]  0~[N] 1~[$]

Queries can be performed in a time proportional to log2A(rank queries can be done in constant times by sampling the rankat regular intervals, using small additional space).

The wavelet tree can be Huffman­compressed to take even less space. With m­ary wavelet trees no more dependence on A.This is the original formulation by Ferragina & Manzini.

Page 37: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

FM­index in real life I: rank queriesFM­index in real life I: rank queriesA compressed SSDS/wavelet transform would be too slow.One needs some other structure to perform fast ranking queries.

In general, one can find very fast binary encodings, but they are space­consuming.

The main low­level bottleneck is memory access (cache misses)!

In general, complex speed/space tradeoff.Anyway, FM­index is the best structure as for space usage:

Hash tables: > 4 B/p

Suffix tree: ~15 B/p

Suffix array: 4 B/p @ 32­bits, 8 B/p @ 64­bits

Enhanced suffix array: 6.25 B/p @ 32­bits

FM­index: 0.5­2 B/p.

Page 38: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

FM­index in real life II: mismatchesFM­index in real life II: mismatchesFM­index provides for very fast exact searches.How to accommodate substitutions/indels?

Any query with mismatches can be expressed in terms of a set of exact queries.For instance, having to align ACGT with at most 2 mismatches one might spawn a tree:  (T,0)=>(GT,0)      =>(CGT,0)               => ...         ([A|C|T]T,1)=>(C[AT|CT|TT],1)       => ...                       ([A|G|T][AT|CT|TT],2) => ...  ([A|C|G],1)=>(G[A|C|G],1)      => ...               ([A|C|T][A|C|G],2)=>(AC[A|C|T][A|C|G],2)                                 =>(CC[A|C|T][A|C|G],2)

If the number of mismatches is small, many branches will die.In the worse case, exponential, at least up to some depth.

How to prune the search tree is still an open research problem.

Page 39: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Tricks of the tradeOr, how to simulate the speed your algorithm cannot provide

• The basic trick is avoiding to explore the whole, possibly huge, search space The kosher way: pruning unnecessary search paths; the non-kosher way: pruning more or less arbitrary search paths.

The latter is bad, as for each read some of the possible matches might be lost: impossible determining read uniqueness.

• Typical non-kosher tricks:

• providing only one/several more or less arbitrarily defined “best” match/es without accurate counts

(for instance, Bowtie in its infamous “stop-after-the-first-match” mode)

• seeding (makes more sense due to the typical error behavior of high-throughput reads, but will still miss matches).

• Possibly acceptable in our opinion is aligning only a suitably chosen subset of the reads (provided that such reads are clearly tagged in the output).

Page 40: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Each mapper is actually differentsometimes in very, very subtle ways

Page 41: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

FM­index in real life III:FM­index in real life III:

GEMGEM

Page 42: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The GEM (GEnome Multitool) projectShort-read mapping and more, 100% made-in-Spain (more precisely, in Catalunya)

• Initiated by Roderic Guigó (CRG, 2008) to power the first Spanish Illumina Genome Analyzer installed at the CRG.

• Architecture design and first 0.x implementation by Paolo Ribeca (CRG, 2008-2010; CNAG, 2010-2012) to offer several tools for short-read mapping (mapper, split mapper, mappability, etc.).

Based on a C/OCaml library entirely written from scratch, which implements original algorithms.

It is now the mapping workhorse at the CNAG.

• Optimized 1.x mapping by Santiago Marco (CNAG, 2010-2012)

with impressive improvements as to speed and features.

http://gemlibrary.sourceforge.nethttp://gemlibrary.sourceforge.net

Page 43: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The GEM projectThe GEM projectThe GEMGEM (:= GEGEnome MMultitool) library aims to be a complete toolbox for short read processing.

32/64 bit archives, DNA/generic text. Reasonably fast generation (~4hrs for the human genome, 3GB), no size limitation.

Based on BWT (Burrows­Wheeler Transform)/FM (Ferragina­Manzini) indexing.Independently started in June 2008, custom algorithms, no relation with programs like SOAP2 or bwa.

High­performance core C library (written from scratch) and higher­level OCaml, Python, and PERL programming interfaces. Several top­level programs (gem-mapper, gem-rna-mapper, etc.).

In use since fall 2008 at the CRG seq unit, now integrated in the CNAG pipeline, several published biological papers based on it.

Page 44: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The GEM “philosophy”The GEM “philosophy”

Recent proliferation of mapping tools (most “just because it is fashonable” or aimed at “speed at all cost”). Since the beginning, we tried to tackle the subject from a rather different angle:

Focus on quality and control rather than on “ultrafastness”

GEM should be an evolvable framework to study and develop new algorithms (think about 3rd generation machines!), not just a quick­n­dirty bugged piece of software to get a publication

...but in the end, we would also like to be as fast as possible(more and more data to be processed!).

Page 45: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

The GEM philosophyGive the user high-quality tools, and full control over them

• We do our best to provide you with bleeding-edge performance

thanks to our innovative algorithms.

• We do not take arbitrary decisions for you because we do not know the details of what you are doing.

• We do our best to provide you with versatile tools so that you can tune your analysis to produce the best possible biological results.

But ultimately, ensuring that this is the case is your responsibility! You should be a conscious user.

Page 46: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Salient features

provided by the upcoming GEM mapper version 1.x

• Support for many data types:

• quality-aware (can map reads with low quality) • single-end and paired-ends mapping • colorspace mapping (still some loose ends).

• Several distance definitions & mapping modes: • Hamming- and Levenshtein-distance (exhaustive+complete counts) + 1 big indel (not exhaustive) • the number of additional strata after the best one can be freely specified (variable-depth mapping) • fast stratum-preserving mapping modes (map-only-cheap-reads mode, map-only-unique-reads mode).

• The ideal/unique tool to, at the same time: • rigorously assess read uniqueness • perform high-precision studies (detect variations) • map quickly with a higher number of mismatches (4-6 @ 100 nt, 6-9 @ 150 nt, close species).

Page 47: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Benchmarks overviewfor the upcoming GEM mapper 1.x (from S.Marco-Sola et al., Nat.Methods, in press)

Page 48: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Benchmarks overviewfor the upcoming GEM mapper 1.x (from S.Marco-Sola et al., Nat.Methods, in press)

Page 49: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Some things you can do with GEMSome things you can do with GEMMapping. The gem-mapper is fast, accurate (counts, multiple, exhaustive, quality­aware, over­resolution) and versatile (short/long reads, no assumptions).

Splice mapping (RNA). The gem-rna-mapper finds split junctions (even inter­chromosomal) in an unconstrained way.

Mappability. How many times does each k­mer in the genome appear in the genome? (ChIP­seq/RNA­seq normalization, structural studies, UCSC tracks).

Pipelines (gem-map-2-map) and conversions (gem-2-sam).

Development of new algorithms (a library: bindings to C and OCaml provided, Python coming soon). In perspective, the most important use.

Page 50: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

RGASP GEM results (UCSC custom track in SAM format).A fairly complete frameworkA fairly complete framework

Page 51: Algorithms for high-quality mapping of NGS reads Paolo Ribecabioinformatica.isa.cnr.it/BBCC/BBCC2012/PDF/PRibeca-B4OS-Naples... · Algorithms for high-quality mapping of NGS reads

Thank you!

Simon Heath

Micha Sammeth

Leonor FriasIvo Gut

Tyler Alioto...and many others.