Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Next Generation DNA Sequencing

IPM-NUS Workshop on Computational Biology

Mehdi Sadeghi

DNA sequencing methodologies: 1977

• Maxam-Gilbert – base modification by

general and specific chemicals.

– depurination or depyrimidination.

– single-strand excision.– not amenable to

automation

• Sanger– DNA replication.– substitution of

substrate with chain-terminator chemical.

– more efficient– automation?

DNA sequencing: Chemistry

DNA sequencing: Chemistry

template + polymerase +

dCTPdTTPdGTPdATP

ddATPddGTPddTTPddCTP

extension

electrophoresis

A•TG•CA•TT•AC•GT•AG•CG•CA•TG•CT•AT•AC•GT•AG•CA•T

Capillary electrophoresis

ABI 370s-series

DNA sequencing: Computation

DNA sequencing

DNA SequencingGoal:

Find the complete sequence of A, C, G, T’s in DNA

Challenge:

There is no machine that takes long DNA as an input, and gives the complete sequence as output

Can only sequence ~500 letters at a time

Genome Sequencing

1515

ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…

ACGTGGTAATGGCGTATACACCCTTAGGCCATA

Short fragments of DNA

AC..GCTT..TC

CG..CA

AC..GC

TG..GT TC..CC

GA..GCTG..AC

CT..TGGT..GC AC..GC AC..GC

AT..ATTT..CC

AA..GC

Short DNA sequences

ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT...

Sequenced genome

Genome

Sequencing strategies

Whole genome

DNA sequencing – vectors

+ =

DNA

Shake

DNA fragments

VectorCircular genome(bacterium, plasmid)

Knownlocation

(restrictionsite)

Different types of vectors

VECTOR Size of insert

Plasmid2,000-10,000

Can control the size

Cosmid 40,000

BAC (Bacterial Artificial Chromosome)

70,000-300,000

YAC (Yeast Artificial Chromosome)

> 300,000

Not used much recently

Sanger sequencing

• DNA is fragmented• Cloned to a plasmid

vector• Cyclic sequencing

reaction• Separation by

electrophoresis• Readout with

fluorescent tags

Sanger Sequencing

• Advantages Long reads (~750bps) Suitable for small projects

• Disadvantages Low throughput Expensive

20

Method to sequence longer regions

cut many times at random (Shotgun)

genomic segment

Get one or two reads from each segment

~500 bp ~500 bp

Reconstructing the Sequence (Fragment Assembly)

Cover region with ~7-fold redundancy (7X)

Overlap reads and extend to reconstruct the original genomic region

reads

Definition of Coverage

Length of genomic segment: L

Number of reads: n

Length of each read: l

Definition: Coverage C = n l / L

How much coverage is enough?

C

Assembly: How Much DNA?

24

many pieces to assemble

High coverage:

a few contigs, a few gaps

Low coverage:

A few pieces to assemble

many contigs, many gaps

Input OutputLander and Waterman,

1988

Challenges with Fragment Assembly

• Sequencing errors

~1-2% of bases are wrong

• Repeats

false overlap due to repeat

RepeatsBacterial genomes: 5%Mammals: 50%

Repeat types:

• Low-Complexity DNA (e.g. ATATATATACATA…)

• Microsatellite repeats (a1…ak)N where k ~ 3-6(e.g. CAGCAGTAGCAGCACCAG)

• Transposons – SINE (Short Interspersed Nuclear Elements)

e.g., ALU: ~300-long, 106 copies– LINE (Long Interspersed Nuclear Elements)

~4000-long, 200,000 copies– LTR retroposons (Long Terminal Repeats (~700 bp) at each end)

cousins of HIV

• Gene Families genes duplicate & then diverge (paralogs)

• Recent duplications ~100,000-long, very similar copies

Strategies for whole-genome sequencing

1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go

Example: Rice genome

3. Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

Whole-Genome Shotgun Sequencing

Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse paired reads

plasmids (2 – 10 Kbp)

cosmids (40 Kbp) known dist

~500 bp~500 bp

Assembly

48

Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994)

contig 1 contig 215Kbp mates

2Kbp mates

~(length―1,000)

~500 bp ~500 bp

resolving repeats

Better assembly of contigs, gap lengths estimation

• Many years of hard work• More than 20.000 BAC clones• Each containing about 100kb fragment• Together provided a tiling path through each human

chromosome• Amplification in bacterial culture• Isolation, select pieces about 2-3 kb• Subcloned into plasmid vectors, amplification, isolation• recreate contigs • Refinement, gap closure, sequence quality improvement• (less 1 error/ 40.000 bases)• BAC based approaches toward WGS

Sequencing of Human Genome

Public Consortium

Sanger Sequencing

51

1980 1990 2000

1982: lambda virusDNA stretches up to 30-40Kbp (Sanger et al.)

1994: H. Influenzae1.8 Mbp (Fleischmann et al.)

2001: H. Sapiens, D. Melanogaster3 Gbp (Venter et al.)

2007: Global Ocean Sampling~3,000 organisms, 7Gbp (Venter et al.)

52

2010: 5K$, a few days

2009: Illumina, Helicos40-50K$

Sequencing the Human Genome

Year

Log

10(p

rice)

201020052000

2012: 100$, <24 hrs?

2008: ABI SOLiD60K$, 2 weeks

2007: 4541M$, 3 months

2001: Celera100M$, 3 years

2001: Human Genome Project2.7G$, 11 years

2nd Generation: Pyrosequencing

• Sequencing by synthesis

• Advantages:– Accurate– Parallel processing– Easily automated– Eliminates the need for labeled primers and

nucleotides– No need for gel electrophoresis

Pyrosequencing• Basic idea:

– Visible light is generated and is proportional to the number of incorporated nucleotides

– 1pmol DNA = 6*1011 ATP = 6*109 photons at 560nm DNA Polymerase I from E.coli.

pyrophospate

From fireflies, oxidizes luciferin and generates light

http://en.wikipedia.org/wiki/File:Firefly_luciferin.svg

http://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Adenosine_phosphosulfate.svg/620px-Adenosine_phosphosulfate.svg.png

• 1st Method– Solid Phase

• Immobilized DNA• 3 enzymes• Wash step to remove nucleotides after each addition

Pyrosequencing

• 2nd Method– Liquid Phase

• 3 enzymes + apyrase (nucleotide degradation enzyme)– Eliminates need for washing step

• In the well of a microtiter plate:• primed DNA template• 4 enzymes

• Nucleotides are added stepwise

• Nucleotide-degrading enzyme degrade previous nucleotides

Pyrosequencing

Pyrosequencing

Pyrosequencing Results:

Disadvantages

• Smaller sequences

• Nonlinear light response after more than 5-6 identical nucleotides

Pyrosequencing

60

Next Generation Sequencing: Why Now?

62

High Parallelism is Achieved in Polony Sequencing

PolonySanger

Next Generation Sequencing

• DNA is fragmented

• Adaptors ligated to fragments

• Several possible protocols yield array of PCR colonies.– Emulsion PCR– Bridge PCR

• Enyzmatic extension with fluorescently tagged nucleotides.

• Cyclic readout by imaging the array.

Next Generation Sequencing

• 454 Life Sciences/Roche– Genome Sequencer FLX: currently produces 400-600

million bases per day per machine

– Published 1 million bases of Neanderthal DNA in 2006

– May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage)

• Solexa/Illumina– 10 GB per machine/week

– May 2008 published complete genomes for 3 hapmap subjects (14x coverage)

• ABI SOLiD– 20 GB per machine/week

“Paradigm Shift”

• Standard ABI “Sanger” sequencing – 96 samples/day– Read length ~750 bp– Total = 70,000 bases of sequence data

• 454 was the game changer!– ~400,000 different templates (reads)/day– Read length ~250 bp– Total = 100,000,000 bases of sequence

data!!!

Solexa ups the Game

• Solexa (Illumina GA)– 60,000,000 different sequence templates

(yes that is an 60 million reads)

– 36 bp read length– 4 billion bases of DNA per run (3 days)

• Each system works differently, but they are all based on a similar principals: – Shear target DNA into small pieces– bind individual DNA molecules to a solid surface, – amplify each molecule into a cluster– copy one base at a time and detect different

signals for A, C, T, & G bases– requires very precise high-resolution imaging of

tiny features (charge-coupled device (CCD) )

454

• First high-throughput DNA sequencer, commercially

available in 2004• Now produces ~500 MB reads of 500 bp• Run of 8 samples in 10 hours, so can do multiple runs/week• Uses pyrosquencing, beads, and a microtiter plate • Low error rate, but insert/delete problems with

homopolymers (stretches of a single base)

Illumina Genome Analyzer

• Originally developed by Solexa, now subsidiary of Illumina.

• Commercially available in 2006• Now produces 8-12 million reads per sample of 36 bp

length = 10 GB/week. • Run takes 3 days for 7 samples.• Low error rate, mostly base changes, few indels

Call Sequence

ABI-SOLiD

• First commercially available in late 2007• Currently capable of producing 20 GB of data

per run (week)• Most users generate 6 GB/run• Reads ~30 bp long• Uses unique

sequence-by-ligation method• “color-space” data• Very low error rate

Comparison of existing methods

454 vs Solexa

• Read length: 400 bp• Number of reads: 400.000• Per-base cost greater• de novo assembly,

metagenomics

•Read length: 40 bp•Number of reads: millions•Per-base cost cheaper•Ideal for application requiring short reads

Applications• “If you build it, they will come.”• An explosion of scientific innovation!• Every new technology enables new

applications, which are not directly foreseen by the original developers of the tech.

• Cheap access to high-volume sequencing becomes a data collection method for many different types of experimental applications

• Ancient DNA• DNA mixtures from diverse ecosystems, metagenomics• Resequencing previously published reference strains• Identification of all mutations in an organism• Expand the number of available genomes• Comparative studies• Deciphering cell’s transcripts at sequence level without knowledge of the genome sequence• Sequencing extremely large genomes, crop plants• Detection of cancer specific alleles avoiding traditional cloning• Chip-seq: interactions protein-DNA• Epigenomics• Detecting ncRNA• Genetic human variation : SNP, CNV (diseases)

Usage of sequencing data

• Transcriptome (RNA) sequencing• Differential expression• Alternative splicing

• Complete/targeted genome (DNA) resequencing

• Polymorphism and mutation discovery

De Novo sequencing

• New species/strains• Challenge of assembly with short reads

– 8x coverage of 3 GB genome = 750 million fragments– Exponential problem for all-vs-all algorithm

• Again big problem with repeats• Assemble contigs, fill gaps• Paired-end reads are essential• Can sequence the entire genome of a microbe in

a single run

Assembly

Resequencing(mutation discovery/genotyping)

• A lot of current sequencing effort is spent on re-sequencing genomes of known species– Individual humans (1000 Genomes Project)– Experimental organisms – looking for genetic

variation, copy number variation• Challenge is to (quickly) align millions of

sequence reads to a reference genome with some % of mismatches

• Challenge to accurately call SNPs and indels• Problems with repeated sequences – both

tandem and dispersed repeats

Read length and pairing

• Short reads are problematic, because short sequences do not map uniquely to the genome.

• Solution #1: Get longer reads.• Solution #2: Get paired reads.

ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

RNA Sequencing• “Digital Gene Expression” or “RNA-Seq”• Truly accurate gene expression measurements

– Can replace gene expression microarrays • 25% more sensitive• Does not rely on hybridization (no %GC bias, no cross-

hybridization between related genes)

• Discover novel genes (and other kinds of RNA

molecules) – one experiment found that 34% of human transcripts were

not from known genes• Sultan et al, Science. 2008 Aug 15;321(5891):956-60.

More information from RNA

• Can capture true alternative splicing information– Sequence of splice-junctions

• One study found 4,096 previously unknown splice junctions in 3,106 human genes

– Different transcription start and end points for RNA molecules

• Allelic variation (SNPs) • Small RNAs

Metagenomics• Survey/discovery all of the species present in an

Environmental or Medical sample• “Human Microbiome”

– disease vs. healthy microbe populations in mouth, intestines, skin, reproductive tract, etc

• Complete multiple genome sequencing

• Complete multi-species transcript profiling (metabolic reconstruction)

• Deep sampling of genetic variation in microbial populations (frequency of drug resistant, toxin producing, etc.)

Informatics is the Bottleneck

• Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it

• Customized analysis / Bioinformatics consulting is needed for every project

Bioinformatics Challenges

• Need for large amount of CPU power– Informatics groups must manage compute clusters– Challenges in parallelizing existing software or redesign of

algorithms to work in a parallel environment– Very large text files (~10 million lines long)– Impossible memory usage and execution time

Future Directions

• Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years.

• complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years.

• Data storage and analysis bottleneck• Data security/privacy issues

genomic segment

AC..GCTT..TC

CG..CA

AC..GC

TG..GT TC..CC

GA..GCTG..AC

CT..TGGT..GC AC..GC AC..GC

AT..ATTT..CC

AA..GC

Short DNA sequences

ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…

ACGTGGTAATGGCGTATACACCCTTAGGCCATA

Overview

Whole genome shotgun sequencing

• Genomes • Transcriptomes• Metagenomes

• De Novo Assembly• Template Based Assembly

De Novo sequencing

• New species/strains• Challenge of assembly with short reads

– 8x coverage of 3 GB genome = 750 million fragments (32 bp)

– Exponential problem for all-vs-all algorithm• Again big problem with repeats• Assemble contigs, fill gaps• Paired-end reads are essential• Can sequence the entire genome of a microbe in a

single run

Genoem Sequencing

• Assembly Algorithms– Shotgun sequencing assembly problem

• Find the shortest common superstring of a set of sequences.

• Given strings {s1, s2, …} find the shortest string T such that every si is a substring of T.

• This is NP-hard.

Greedy Algorithm

• Nodes are fragments

• Edges means there exist overlaps.

• Weight are number of overlaps found after calculateing pairwise alignments of all fragments.

Greedy Algorithm

• Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e

• Hamiltonian paths: A path that goes through every vertex

Greedy Algorithm

• Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.

• “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge

• Assembly Algorithms• Overlap-layout-consensus

–An assembler builds the graph –Output is a set of nonintersecting simple

paths, each path being a contigue.

Genoem Sequencing

Overlap-layout-consensus

• Overlap-layout-consensus method for assembly.– Build an overlap graph where each node

represents a read. An edge exists between two reads if they overlap.

– Traverse the graph to find unambiguous paths which form contigs.

Overlap graph for a bacterial genome. The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right). The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Overlap-layout-consensus

Next-generation sequencing

• Lower cost / base pair

• Very short fragment lengths (25-75bps)

• High error rate

• Inherent ability to do paired-end (mate-pair) sequencing.

Next-generation sequencing

• Challenging to assembly data.• Short fragment length = very small overlap

therefore many false overlaps

• Sequenced up to 100x coverage, increase in data size.

• Large number of reads + short overlap + higher error rate make traditional overlap - layout - consensus approach impractical.

Current approaches

• Euler / De Bruijn approach.

• Introduced as a alternative to overlap-layout-consensus approach in capillary sequencing.

• More suited for short read assembly.

• Assembly Algorithms• Eularian path

– Eularian path – a path that visits all edges of a graph

– Breaks reads into overlapping n-mers.– Source – n-1 prefix and destination is the n-

1 suffix corresponding to an n-mer.– Basic problem is to find a path that uses all

the edges. – Eularian path is more efficient.

Genoem Sequencing

Eulerian Circuits and PathsEulerian Circuit – visits each edge in a graph exactly

once, and ends at the same vertex in which it started.

a-d-b-f-e-d-f-c-b-a is an Eulerian cycle in this particular graph

ab c

d fe

Eulerian Path – visits each edge in a graph exactly once.

a

b c

d

f

e

ji

h

g

h

a-b-c-d-e-f-g-c-h-f-i-j is an Eulerian trail in this particular graph

De Bruijn Graphs

• Nodes are (k-1)-mers• Edges are k-mers

• The set of k-mers is called a k-spectrum

• Finding shortest string with given k-spectrum.

{AGC, ATC, ATT, CAG, CAT, GCA,

TCA, TTC}

CA

GC AG

TC AT

TT

• Break each read sequence to overlapping fragments of size k. (k-mers)

• Form De Bruijn graph such that each (k-1)-mer represents a node in the graph.

• Edge exists between node a to b iff there exists a k-mer such that it’s prefix is a and suffix is b.

• Traverse the graph in unambiguous path to form contigs.

De Bruijn Graphs

• K = 4

De Bruijn Graphs

Eulerian Path Approach to DNA Fragment Assembly

• Ultimately, converts an NP-complete Hamilton Path Problem into a simplified Eulerian Path Problem through construction of a de Bruijn graph

•The number of ways to reconstruct the graph is equivalent to the number of paths which follow the respective directions and travel through all edges

•The resulting problem is that there are a number of different Eulerian Paths through this graph, and we cannot tell which would resemble the original path

Eulerian Superpath Problem

•Eulerian Superpath Problem – Given an Eulerian Graph and a collection of paths on this graph, find an Eulerian path in this graph that contains all these paths as subpaths.

•The original Eulerian Path Problem is a case of the Eulerian Superpath Problem, in which every path is a single edge.

Solving: Take graph G and the system of paths P, and transform these to a new graph G1 and a new system P1. With the goal in mind that there is a one-to-one correspondence (equivalence) between (G,P) and (G1,P1), we go on to make a series of these transformations.

(G,P) → (G1,P1) → (G2,P2) →…→ (Gk,Pk)

All these transformations should lead to a system Pk in which every path is represented by one edge. Since all transformations from beginning to end are equal, every solution of EPP in (Gk,Pk) will provide a solution to the ESPP in (G,P).

An x,y-detachment for no multiple edges Let x = (vin,vmid) and y = (vmid,vout) be two consecutive edges in G and Px,y be all paths from P that include x,y as a subpath.

P→x is the paths from P that end on x and Py→ is the collection of paths from P that start with y.

Adding a new edge z = (vin,vout) to delete the edges x and y.

We can substitute z instead of x,y in all paths from Px,y, x in all paths from P→x, and y in all paths from Py→. Thus, reducing an ESPP to an EPP.

• Elegant way of representing the problem.• Very fast execution.• Error correction can be handled in the graph.• De Bruijn graph size can be huge.

– ~200GB for human genomes.

• Does not use pair information in initial phase, resulting in overlay complicated graphs.

De Bruijn Graphs

Repeats

• Repeats in the sequence– Assembly programs should detect repeats in

the assembly process and not after. • Incorrect genome reconstruction

– Assemblers should try to resolve correctly as many repeats as possible.

• Detecting repeats– Euler assembly program

• Finds repeats by complex parts of the graph constructed during the assembly process.

• Researchers look into these complex areas to try and resolve repeats.

• Assemblers can use clone mate (paired end) information to find incorrect assemblies. This is based on finding clone-mate pairs too close or too far from one another.

Repeats

ASSEMBLY OF READS WITH ERRORS

• Errors in read data greatly complicate the task of fragment assembly.

• Error correction is performed prior to assembly by solving the error correction problem.

Resequencing(mutation discovery/genotyping)

• A lot of current sequencing effort is spent on re-sequencing genomes of known species

– Individual humans (1000 Genomes Project)– Experimental organisms – looking for genetic

variation, copy number variation• Challenge is to (quickly) align millions of sequence reads

to a reference genome with some % of mismatches• Challenge to accurately call SNPs and indels• Problems with repeated sequences – both tandem and

dispersed repeats

Need to alignment programs to map short sequencing reads from next-generation sequencing technologies to a reference genome are introduced

151

New Challenge

given a set of reads R, for each read r ∈R, find its target regions on the reference genome G, such that for each target region t there are at most k mismatches between r and t.

152

The reads mapping problem

Aligner algorithms can be divide in to two categories :

Seeded alignments algorithms (BLAST like)

Burrows-Wheeler transform based algorithms

154

Aligner algorithms

BLAST is the most popular tool.Requires a query sequence to search for, and a

sequence to search againstStep 1: Make a k-letter word list of the query sequence.

Step 2: List the possible matching words

step 3: extend the match to find the high similarity pair

TAGGACCTAACC

GACCACCTTTT

155

TAGGACCTAACC

GACCACCTTTT

Seed alignment algorithm

Find seeded matches of 11 base pairs

Extend each match to right and left, until the scores drop too much, to form an alignment

Report all local alignments

Example: AGCGATGTCACGCGCCCGTATTTCCGTA TCGGATCTCACGCGCCCGGCTTACCGTG

| | | | | | | | | | | | | | | | || | |

0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0``

156

Blast algorithm

Spaced Seed: nonconsecutive matches and optimized match positions.

Represent BLAST seed by 11111111111 Spaced seed: 111010010100110111

1 means a required match0 means “don’t care” position

The length of the seed is the string length, and the weight of the seed is the number of 1s in the string.

This seemingly simple change makes a huge difference: significantly increases hit to homologous region while reducing bad hits.

157

Spaced seed

Multiple simultaneous seeds are defined as a set of seeds.∏= {seed1, seed2,…seed i,…, seedn}

∏ detects a similarity if at least one of the component seeds detects the similarity

ExampleSimultaneous seeds {1101, 1011} detect

similarities 100110100001, 1000010110001, 1101001011001

158

Multiple simultaneous seeds

The prefix trie for string X is a tree where each edge is labeled with a symbol and the string concatenation of the edge symbols on the path from a leaf to the root gives a unique prefix of X.

On the prefix trie, the string concatenation of the edge symbols from a node to the root gives a unique substring of X .

The prefix trie of X is identical to the suffix trie of reverse of X and therefore suffix trie theories can also be applied to prefix trie

159

Let ∑ be an alphabet. Symbol $ is not present in and is lexicographically smaller than all the symbols in ∑

A string X=a0a1 ...an−1 is always ended with symbol $ (i.e. an−1=$)

Suffix array S of X is a permutation of the integers 0...n−1 such that S(i) is the start position of the i-th smallest suffix. 160

For compute S(.), string X is circulated to generate strings, which are then lexicographically sorted.

161

After sorting, the positions of the first symbols form the suffix array.

BWT(X) is the last column of the sorted matrix.

162

Most algorithms for constructing suffix array require at least nlog2n bits of working space, which amounts to 12GB for human genome.

Recently, Hon et al. (2007) gave a new algorithm that uses n bits of working space and only requires <1GB memory at peak time for constructing the BWT of human genome

164

If string W is a substring of X, the position of each occurrence of W in X will occur in an interval in the suffix array.

Based on this observation, we define:

R(W) = min{k :W is the prefix of XS(k)}R’(W) = max{k :W is the prefix of XS(k)}

(Xi=X[i,n−1] a suffix of X)In particular, ifW is an empty string, R(W)=1 and R’(W)=n−1.165

The interval [R(W) ,R(W)’] is called the SA interval of W and the set of positions of all occurrences of W in X is

{S(k) :R(W) ≤k≤ R(W)’}

For example the SA interval of string ‘go’ is [1,2]The suffix array values in this interval are 3 and 0 which

give the positions of all the occurrences of ‘go’ in the “googol”. 166

Knowing the intervals in suffix array we can get the positions.

Therefore, sequence alignment is equivalent to searching for the SA intervals of substrings of X that match the query.

For the exact matching problem, we can find only one such interval

167

We can compute SA intervals for all node in the trie and each read map equivalent to search the tree.

168

We can compute SA intervals for all node in the trie and each read map equivalent to search the tree.

169

Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Documents

Transcript of Next Generation DNA Sequencing IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.