Algorithms for high-quality mapping of NGS reads Paolo...

Algorithms for high-quality mappingof NGS reads

Paolo RibecaAlgorithm DevelopmentCentro Nacional de Análisis Genómico, Barcelona

Bioinformatics for Omics Sciences, Napoli, 26.09.2012

Our HTS setupOur HTS setupThe Spanish National Sequencing Center

Core funding 20102012:

15 M€ from national government

15 M€ from Catalan government

12 Illumina HiSeqs2000, 2 GAIIx, 1 MiSeq

~1 Tb produced per day

Our dedicated cluster has 1200 cores

2,5 PB of dedicated storage

About 50 people (100 in the future), the half of them being bioinformaticians.

ICGC project, plus many others.

A problem of data & software scaling

Throughput constantly improves much more than computing power

• The first Solexa Genome Analyzer @ CRG, 2008: 3 Gb / run

and at the beginning (2010), the CNAG used to have 12 GA IIx: 30 Gb / run each.

• Current Illumina HiSeq 2000 machines produce > 500 Gb / run

after some relatively minor hardware upgrade, which gave a 3x boost.

The CNAG has 12 of them.

• The CNAG has a peak sequencing capacity of ~ 1 Tb / day... while the one originally planned in 2010 was 50-100 Gb / day.

• ...but still “only” 1000 processors in its cluster

which is the amount originally intended to process 50-100 Gb / day.

• Need for more optimized algorithms!

But what is this HTS anyway?But what is this HTS anyway?Actually, a large body of completely different technologies.

Singlemolecule sequencing, or clusterbased (amplificationdependent) sequencing.

Sequences can be read optically (fluorescence, laser) or electrically.

Only one end of the molecule is sequenced (singleend, SE) or both (pairedend, PE).Or even strobed sequencing, with 4 or more chunks.

Sequencing errors can be very different (homopolymers or substitutions).

Typical platforms are:

Illumina/Solexa

ABI/SOLiD

Roche/454.

Practical problemsPractical problems¡Having sequencers ¡Having sequencers ≠ ≠ producing good scientific data!producing good scientific data!

Many small research groups buy sequencers just to find out that they produce a flood of meaningless data.

Keeping the machines idle is very expensive.

At least two places where expertise is essential:

Sample preparation/Protocols

Downstream analysis protocols.

A horde of different biological protocols for data production are commonly used: DNAseq, RNAseq, ChIPseq, methylation/epigenetic studies, etc..For each one, a different analysis protocol is required!Protocols are platformdependent!

Still a long way to go before good protocols are establishedStill a long way to go before good protocols are established.

What can be done with HTSWhat can be done with HTS

DNA (wholegenome, targeted, exome) resequencing: YES

SNP calling: YES

Variant calling: SOMETIMES

DNA shotgun sequencing & denovo assembly: NO

(with fosmid pooled sequencing: SOMETIMES/YES)

RNA quantification:

Genelevel: SOMETIMES/YES

Isoformlevel: NO

… and so on.

What should be the job of an Algorithm Development group?

Practical computational problemsPractical computational problemsTwo main categories of problems:

Technological problems deriving from short reads:

Mapping (that is, HT alignment)

Assembly

Most of the problems are platformdependent!

Problems deriving from the huge amount of data (storage!).

Some of the technological problems will be the subject of my talks.

Sizerelated problems are trickier: sequencing vs. storage.

Although highly correlated from cell to cell, the information content of a human body is huge (6 Gbit/cell)

Clonal subpopulations. Microbiome.

Is each biological sample really unique?

HT AlignmentHT Alignment

General setupGeneral setupResequencing problem.The general assumption is that (a reasonable approximation of) the reference genome (for instance, hg19) is known.

This is often not the case (new species, metagenomics).

When this is the case, many interesting biological situations:

DNA/exome resequencing (SNPs & structural variations):domestication, agronomy, cancer, genetic diseases/disorders, genotyping, personalized medicine

RNA, expression quantification

Epigenomics/chromatine structure & regulation (ChIPseq, histone modifications, micrococcal assays for nucleosomes)

Some technical issues : mappability.

The problem of short read alignmentThe problem of short read alignmentAlignment to a reference in a HTS setup:

Many short reads (billions) to be matched against...

[850 GB produced per day, 100150 PE nt reads: 34 Gr/day]

Very long references (many GB), with...

Several mismatches (typically one each 25 nt), and...

Problemdependent alignment! RNA requires spliced alignment, bisulfite requires a modified reference, etc..

With those specifications, “old” tools like BLAST do not offer adequate performance anymoreWith those specifications, new tools might not offer adequate accuracy anymore (many indels, RNA editing, etc.).

¡Speed is not the only important thing!¡Speed is not the only important thing!Often, tradeoff between speed and quality.

Is that really all there is to it?Is that really all there is to it?Many additional requirements:

Qualities

Pairedend/Matepair

Bisulfite

Realignment

Platformdependent problems!

Homopolymers (Roche/454)

Colorspace (ABI/SOLiD, Helicos)

Quartets (Complete Genomics)

Very long reads with very many indels (Pacific Biosciences)

…

Storage/pipeline/data manipulation related problems.

Qualities:@EAS131_1_FC30C5KAAXX/1GAGTTTCCTCCTGCAGATGTGAACTGTGTAAATAGTCAGAACTGATCGA+EAS131_1_FC30C5KAAXX/1aabaabbaaaaaaaaaaaaaaaabbaaaaaabaaaZaaaaaaUKZEUaZ

Pairedend/Matepair:ACGTTTTCAGACAGAACGATACTAGATCA

Bisulfite: unmeth. C => U, that is TACGTTCTTCAGAC => ACGTTTTTCAGAT or ATCTGAAAAACGT

Realignment:ACGTTTTCGATGATAGAT ACGTTTTCGATGATAGAT GTTTTCGTGATAGAT => GTTTTCGTGATAGAT TTTCGTGATAGAT TTTCGTGATAGAT

Colorspace: A C G TA 0 1 2 3C 1 0 3 2 AGTCTATCTC => A212233222G 2 3 0 1T 3 2 1 0

Implementor's and user's standpoints Implementor's and user's standpoints Several possible indexing techniques, with flames in the literature and claims of being “superior”. Typically used:

Hash tables PROS: Searches with many errors easierCONS: Slower, error rate is predetermined, bulky

Some data structure like the BurrowsWheeler TransformPROS: Very fast exact searches, no hardcoded parametersCONS: Searches with errors more difficult.

What to index?

Preindexing the genome and scanning the reads

Preindexing the reads and scanning the genome

Preindexing both, and doing mergesort.

From the point of view of the user, all of that is quite irrelevant. What matters is the requirements from the problem (number of mismatches/precision needed) and the quality of obtained results.

User's StandpointUser's Standpoint

The short-read mapping problemNot as simple as just downloading software, and running it on your data

• There is no such a simple thing like “finding good matches” since there are many possible more or less accepted ways of defining what should be considered “good”.

• Ideally, “good” matches are defined in terms of some string distance... which allows to compare the query and the match, and decide how “good” is the match.

• ...whose definition might be very, very complicated (think about “classical” biological alignment à la BLAST).

While mapping, however, simpler definitions are usually accepted, typically:






• Hamming distance (the number of mismatches)







• Levenshtein distance (the number of edit operations; modified definitions are common, though).







• Levenshtein distance (the number of edit operations; modified definitions are common, though).

In general, the need for fast mapping usually implies more relaxed accuracy requirements (sometimes justified).

The “iceberg model” for mappings

The matches you find are just the tip of the iceberg

• In general, given a larger and larger distance, one will find more and more matches eventually being able to align to all the locations in the reference.

This is alike to a submerged iceberg: the tip is the query, and the part above the surface the results of the search.

The more permissive the parameters of the search, the larger the part of the iceberg tipping out of the water.

• A “stratum” is a set of matches having the same distance from the query that is, a layer of the iceberg made of matches which all have the same distance from its tip.

One can output only 1 stratum (just the best matches), 2 strata (optimal + best suboptimal), and so on.

Most mappers can access only the best stratum. This influences the definition of uniqueness.

• The number of matches in successive strata grows depending on the read for instance, depending on its degree of repetitiveness. This is portrayed by the “slant” of the iceberg tip.

Implementor's StandpointImplementor's Standpoint

Hash tables IHash tables IBasic idea: one can map strings to numbers via a hashing function H.One example is md5sum.My file becomes 28831fa678f075862b9da748802fcba7.

Finding good (clashfree) hashes for big strings is difficult.

Here we have small strings... Simple solution: just encode the string.For instance, C G T A C G A T A C G G G G => 1 2 3 0 1 2 0 3 0 1 2 2 2 2 => 0110110001100011000110101010

We can take advantage of bitparallelism tricks to do things faster(k=32 bases could fit into a single 64bit word).

We install all the reads in a hash table of size S: read R goes to a bucket corresponding to position H[R] mod S (S being a “small” number like 10,000,000). We scan, say, the genome (sliding window of 32) checking the hash for the kmer appearing at each position.

Hash tables IIHash tables IIHow to accommodate mismatches? Spaced seeds/filters.For instance:111111111111100000000000000000000000000011111111111110000000000000000000000000000000001111111111111111111100000011111100000000000000(1=match,0=don't care) can detect up to two mismatches for reads of length up to 33.Indels are more complicated.

Major problems:

k difficult to change

the hash tables (mismatches => many of them) are bulky.

The scheme can be extended to index both genome and queries:mergesort of the encoded kmers & positions.

String data structuresString data structuresData structures for string manipulation (SDS) are becoming more and more essential: Internet! (and biology).

Due to several reasons, they never really entered the mainstream of computer science before the invention of compressed SDS.

For instance, suffix trees were invented in 1973, but it took them >20 years to become widely known.

Structures relevant to this talk:

Suffix trees (the “mother” of all structures: virtually any kind of fast string operation after a slow preprocessing)

Suffix arrays (simpler and faster than suffix trees, roughly equivalent with the help of some auxiliary data structure)

Compressible structures based on text transforms (BurrowsWheeler transform and FerraginaManzini index).

The Suffix Tree IThe Suffix Tree IThe main idea is always that of creating a list of all suffixes/prefixes present in the string.

(1) Sort lexicographically (2) Identify common blocks (3) Build tree(Nodes must have degree ≥ 2; we are omitting suffix links, needed to stream text).

BANANA$, 0ANANA$, 1NANA$, 2ANA$, 3NA$, 4A$, 5$, 6

The Suffix Tree IIThe Suffix Tree IIWeiner (1973), McCreight (1976), Ukkonen (1995), Gusfield.

Suffix trees show nice algorithmic properties (after a long preprocessing phase, OK for genomes):

They can be built in linear time

Text can be reconstructed if we include suffix links

Pattern matching on P with k mismatches can be done in time O(k length(P) + Nocc) (using matching statistics)

(Super)maximal repeats can be found in O(n + Nocc) time

Many other things —like finding maximal unique matches for genomewide alignments— are possible

(see Gusfield’s book for more details, some 100 pages are dedicated to describe what can be done with STs in the context of biology).

The Suffix Array IThe Suffix Array IConceptually a stripped version of the suffix tree.It is defined as the sequence of the starting position of suffixes.

For instance: T = (B,A,N,A,N,A,$)

0 BANANA$ 6 $BANANA 1 ANANA$B 5 A$BANAN 2 NANA$BA 3 ANA$BAN 3 ANA$BAN => 1 ANANA$B 4 NA$BANA 0 BANANA$ 5 A$BANAN 4 NA$BANA 6 $BANANA 2 NANA$BA

Hence SABANANA$ = (6,5,3,1,0,4,2).

The suffix array is always a permutation of (0, … , n), but not all

permutations are suffix arrays: if A alphabet, eventually n! > An.

The Suffix Array IIThe Suffix Array IIManber and Myers (1991), Kärkkäinen and Sanders (2003).

Suffix arrays show nice algorithmic properties too:

They too can be built in linear time and linear space

Exact pattern matching is still fast and much simpler,being now reduced to a range query

Several other operations permitted by suffix trees can still be performed (but some, notably text reconstruction, cannot)

Various string manipulations are still possible if the data structure is augmented with additional information (like the lcpintervals [“lcp” = longest common prefix], which allow to simulate a bottomup walk of the tree, Kasai)

Other operations can be simulated as well (Abouelhoda, Kurtz and Ohlebusch).

Succinct SDSSuccinct SDSIn recent years (Burrows & Wheeler 1994, Ferragina & Manzini 2001, Sadakane 2002, Grossi, Vitter, Navarro, Mäkinen, Kärkkäinen, etc.) people realized that one can store something much smaller:

Burrows & Wheeler: first compressible reversible transformation of the text, the BWT

Ferragina & Manzini: first practical index (the FMindex) based on the BWT

Sadakane: the index can be turned into a selfindex which allows to reconstruct the text without the text itself.

A practical implementation might store samples of SA, SA–1 and ranks at regular intervals.

The compression rate is usually bound by the entropy of the text (similarly to the BWT).

The BurrowsWheeler Transform IThe BurrowsWheeler Transform IDefinition: BWT[i] = T[SA[i] – 1].For instance: T = (B,A,N,A,N,A,$) 0 BANANA$ 6 $BANANA 5 4 1 1 ANANA$B 5 A$BANAN 4 3 5 2 NANA$BA 3 ANA$BAN 2 6 6 3 ANA$BAN => 1 ANANA$B => 0 => 2 => 4 4 NA$BANA 0 BANANA$ 6 5 0 5 A$BANAN 4 NA$BANA 3 1 2 6 $BANANA 2 NANA$BA 1 0 3

Hence BWTBANANA$ = (A,N,N,B,$,A,A).

“Cyclicize suffixes, take last column”.

The BWT is not by itself a compressed structure, but it can be efficiently compressed by runlength encoding (RLE):it tends to group together identical “contexts”, and hence identical letters tend to be adjacent.

The BurrowsWheeler Transform IIThe BurrowsWheeler Transform IIOf great importance in this context is the LFfunction, defined as LF[i] = SA−1[SA[i] − 1] LFBANANA$ = (1,5,6,4,0,2,3)

(another permutation, but a “wellbehaved” one with many “runs” of consecutive numbers: easy to compress).Sadakane and other authors use the function Ã instead, which in fact is the inverse of LF being defined as Ã[i]=SA−1[SA[i] + 1].

If we store the BWT and LF n (which can be sampled one position in samplerate ones, at the arbitrarily small space cost of 2n/samplerate bits) we can avoid storing SA, SA−1 and Ã:

Depending on the implementation, each retrieval of SA, SA−1 or Ã will then be possible in O(samplerate) or O(A samplerate).

The BurrowsWheeler Transform IIIThe BurrowsWheeler Transform IIIThe fundamental relation holds LF[i] = counts[BWT[i]] + occs[BWT[i], i]where counts is the number of times each letter occurs in the BWT: countsBANANA$[(Ø,$,A,B,N)] = (0,1,4,5,7).

We can do very fast pattern matching by backward search (Burrows and Wheeler) at the price of two “LFlike” computations per step:lookup[P]: lo = 0, hi = n for (i = length[P]; i > 0; i): lo = counts[P[i]] + occs[P[i],lo] hi = counts[P[i]] + occs[P[i],hi] return hi – lo.This is the basic reason why the BWT is a very efficient method to store and query a text.

The FerraginaManzini indexThe FerraginaManzini indexA bit of Italian magic :) :the BWT information can be stored efficiently.

Wavelet trees are a way of encoding the BWT using n log2A bits.They are perfect binary trees, with nodes corresponding to progressively halved subsets of A. BANANA$ 0010101 0~[AB] 1~[N$] 1000001 0~[A] 1~[B] 0~[N] 1~[$]

Queries can be performed in a time proportional to log2A(rank queries can be done in constant times by sampling the rankat regular intervals, using small additional space).

The wavelet tree can be Huffmancompressed to take even less space. With mary wavelet trees no more dependence on A.This is the original formulation by Ferragina & Manzini.

FMindex in real life I: rank queriesFMindex in real life I: rank queriesA compressed SSDS/wavelet transform would be too slow.One needs some other structure to perform fast ranking queries.

In general, one can find very fast binary encodings, but they are spaceconsuming.

The main lowlevel bottleneck is memory access (cache misses)!

In general, complex speed/space tradeoff.Anyway, FMindex is the best structure as for space usage:

Hash tables: > 4 B/p

Suffix tree: ~15 B/p

Suffix array: 4 B/p @ 32bits, 8 B/p @ 64bits

Enhanced suffix array: 6.25 B/p @ 32bits

FMindex: 0.52 B/p.

FMindex in real life II: mismatchesFMindex in real life II: mismatchesFMindex provides for very fast exact searches.How to accommodate substitutions/indels?

Any query with mismatches can be expressed in terms of a set of exact queries.For instance, having to align ACGT with at most 2 mismatches one might spawn a tree: (T,0)=>(GT,0) =>(CGT,0) => ... ([A|C|T]T,1)=>(C[AT|CT|TT],1) => ... ([A|G|T][AT|CT|TT],2) => ... ([A|C|G],1)=>(G[A|C|G],1) => ... ([A|C|T][A|C|G],2)=>(AC[A|C|T][A|C|G],2) =>(CC[A|C|T][A|C|G],2)

If the number of mismatches is small, many branches will die.In the worse case, exponential, at least up to some depth.

How to prune the search tree is still an open research problem.

Tricks of the tradeOr, how to simulate the speed your algorithm cannot provide

• The basic trick is avoiding to explore the whole, possibly huge, search space The kosher way: pruning unnecessary search paths; the non-kosher way: pruning more or less arbitrary search paths.

The latter is bad, as for each read some of the possible matches might be lost: impossible determining read uniqueness.

• Typical non-kosher tricks:

• providing only one/several more or less arbitrarily defined “best” match/es without accurate counts

(for instance, Bowtie in its infamous “stop-after-the-first-match” mode)

• seeding (makes more sense due to the typical error behavior of high-throughput reads, but will still miss matches).

• Possibly acceptable in our opinion is aligning only a suitably chosen subset of the reads (provided that such reads are clearly tagged in the output).

Each mapper is actually differentsometimes in very, very subtle ways

FMindex in real life III:FMindex in real life III:

GEMGEM

The GEM (GEnome Multitool) projectShort-read mapping and more, 100% made-in-Spain (more precisely, in Catalunya)

• Initiated by Roderic Guigó (CRG, 2008) to power the first Spanish Illumina Genome Analyzer installed at the CRG.

• Architecture design and first 0.x implementation by Paolo Ribeca (CRG, 2008-2010; CNAG, 2010-2012) to offer several tools for short-read mapping (mapper, split mapper, mappability, etc.).

Based on a C/OCaml library entirely written from scratch, which implements original algorithms.

It is now the mapping workhorse at the CNAG.

• Optimized 1.x mapping by Santiago Marco (CNAG, 2010-2012)

with impressive improvements as to speed and features.

http://gemlibrary.sourceforge.nethttp://gemlibrary.sourceforge.net

The GEM projectThe GEM projectThe GEMGEM (:= GEGEnome MMultitool) library aims to be a complete toolbox for short read processing.

32/64 bit archives, DNA/generic text. Reasonably fast generation (~4hrs for the human genome, 3GB), no size limitation.

Based on BWT (BurrowsWheeler Transform)/FM (FerraginaManzini) indexing.Independently started in June 2008, custom algorithms, no relation with programs like SOAP2 or bwa.

Highperformance core C library (written from scratch) and higherlevel OCaml, Python, and PERL programming interfaces. Several toplevel programs (gem-mapper, gem-rna-mapper, etc.).

In use since fall 2008 at the CRG seq unit, now integrated in the CNAG pipeline, several published biological papers based on it.

The GEM “philosophy”The GEM “philosophy”

Recent proliferation of mapping tools (most “just because it is fashonable” or aimed at “speed at all cost”). Since the beginning, we tried to tackle the subject from a rather different angle:

Focus on quality and control rather than on “ultrafastness”

GEM should be an evolvable framework to study and develop new algorithms (think about 3rd generation machines!), not just a quickndirty bugged piece of software to get a publication

...but in the end, we would also like to be as fast as possible(more and more data to be processed!).

The GEM philosophyGive the user high-quality tools, and full control over them

• We do our best to provide you with bleeding-edge performance

thanks to our innovative algorithms.

• We do not take arbitrary decisions for you because we do not know the details of what you are doing.

• We do our best to provide you with versatile tools so that you can tune your analysis to produce the best possible biological results.

But ultimately, ensuring that this is the case is your responsibility! You should be a conscious user.

Salient features

provided by the upcoming GEM mapper version 1.x

• Support for many data types:

• quality-aware (can map reads with low quality) • single-end and paired-ends mapping • colorspace mapping (still some loose ends).

• Several distance definitions & mapping modes: • Hamming- and Levenshtein-distance (exhaustive+complete counts) + 1 big indel (not exhaustive) • the number of additional strata after the best one can be freely specified (variable-depth mapping) • fast stratum-preserving mapping modes (map-only-cheap-reads mode, map-only-unique-reads mode).

• The ideal/unique tool to, at the same time: • rigorously assess read uniqueness • perform high-precision studies (detect variations) • map quickly with a higher number of mismatches (4-6 @ 100 nt, 6-9 @ 150 nt, close species).

Benchmarks overviewfor the upcoming GEM mapper 1.x (from S.Marco-Sola et al., Nat.Methods, in press)

Some things you can do with GEMSome things you can do with GEMMapping. The gem-mapper is fast, accurate (counts, multiple, exhaustive, qualityaware, overresolution) and versatile (short/long reads, no assumptions).

Splice mapping (RNA). The gem-rna-mapper finds split junctions (even interchromosomal) in an unconstrained way.

Mappability. How many times does each kmer in the genome appear in the genome? (ChIPseq/RNAseq normalization, structural studies, UCSC tracks).

Pipelines (gem-map-2-map) and conversions (gem-2-sam).

Development of new algorithms (a library: bindings to C and OCaml provided, Python coming soon). In perspective, the most important use.

RGASP GEM results (UCSC custom track in SAM format).A fairly complete frameworkA fairly complete framework

Thank you!

Simon Heath

Micha Sammeth

Leonor FriasIvo Gut

Tyler Alioto...and many others.

Algorithms for high-quality mapping of NGS reads Paolo...

Documents

Transcript of Algorithms for high-quality mapping of NGS reads Paolo...