CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or...
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or...
![Page 1: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/1.jpg)
CS262 Lecture 9, Win07, Batzoglou
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
![Page 2: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/2.jpg)
CS262 Lecture 9, Win07, Batzoglou
![Page 3: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/3.jpg)
CS262 Lecture 9, Win07, Batzoglou
Saving cells in DP
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
![Page 4: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/4.jpg)
CS262 Lecture 9, Win07, Batzoglou
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
![Page 5: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/5.jpg)
CS262 Lecture 9, Win07, Batzoglou
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
![Page 6: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/6.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
![Page 7: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/7.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
x = x1, …, xm
• Find a subsequence
s = s1, …, sk
s1 < s2 < … < sk
![Page 8: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/8.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
Let input be w: w1,…, wn
INITIALIZATION:L: last LIS elt. array L[0] = -inf
L[1] = w1 L[2…n] = +inf
B: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far
ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]
B[j] iP[i] B[j – 1]
}
That’s it!!!• Running time?
![Page 9: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/9.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point (i, j), is inserted into w as follows:
• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
a is before b in w, and y < y’
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
![Page 10: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/10.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (y, x) < (y’, x’) if y < y’
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
![Page 11: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/11.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L = [L1] [L2] [L3] [L4] [L5] …
1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)
Longest common subsequence:s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
![Page 12: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/12.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj: smallest (North) to largest (South) value
L is implemented as a balanced binary tree
y
h
l
![Page 13: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/13.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score
V(b)V(a)
![Page 14: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/14.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
i
j
k
Is k ever removed?
![Page 15: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/15.jpg)
CS262 Lecture 9, Win07, Batzoglou
Example
x
y
a: 5
c: 3
b: 6
d: 4e: 2
2
56
9101112141516
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
a b c d eV
5
L
li
V(i)
i
5
5
a
8
11
8
c
11 12
9
11
b
15
12
d
13
16
13
3
![Page 16: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/16.jpg)
CS262 Lecture 9, Win07, Batzoglou
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
![Page 17: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/17.jpg)
CS262 Lecture 9, Win07, Batzoglou
Examples
Human Genome BrowserABC
![Page 18: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/18.jpg)
CS262 Lecture 9, Win07, Batzoglou
Gene Recognition
![Page 19: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/19.jpg)
CS262 Lecture 9, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
![Page 20: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/20.jpg)
CS262 Lecture 9, Win07, Batzoglou
Where are the genes?Where are the genes?Where are the genes?Where are the genes?
![Page 21: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/21.jpg)
CS262 Lecture 9, Win07, Batzoglou
![Page 22: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/22.jpg)
CS262 Lecture 9, Win07, Batzoglou
Needles in a Haystack
![Page 23: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/23.jpg)
CS262 Lecture 9, Win07, Batzoglou
• Classes of Gene predictors Ab initio
• Only look at the genomic DNA of target genome De novo
• Target genome + aligned informant genome(s)
EST/cDNA-based & combined approaches• Use aligned ESTs or cDNAs + any other kind of evidence
Gene Finding
EXON EXON EXON EXON EXON
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
![Page 24: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/24.jpg)
CS262 Lecture 9, Win07, Batzoglou
Signals for Gene Finding
1. Regular gene structure
2. Exon/intron lengths
3. Codon composition
4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites
5. Patterns of conservation
6. Sequenced mRNAs
7. (PCR for verification)
![Page 25: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/25.jpg)
CS262 Lecture 9, Win07, Batzoglou
Next Exon:Frame 0
Next Exon:Frame 1
![Page 26: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/26.jpg)
CS262 Lecture 9, Win07, Batzoglou
Exon and Intron Lengths
![Page 27: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/27.jpg)
CS262 Lecture 9, Win07, Batzoglou
Nucleotide Composition
• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
![Page 28: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/28.jpg)
CS262 Lecture 9, Win07, Batzoglou
atgatg
tgatga
ggtgagggtgag
ggtgagggtgag
ggtgagggtgag
caggtgcaggtg
cagatgcagatg
cagttgcagttg
caggcccaggccggtgagggtgag
![Page 29: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/29.jpg)
CS262 Lecture 9, Win07, Batzoglou
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
![Page 30: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/30.jpg)
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Intergene State
Intergene State
First Exon State
First Exon State
IntronStateIntronState
![Page 31: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/31.jpg)
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
exon exon exonintronintronintergene intergene
Intergene State
Intergene State
First Exon State
First Exon State
IntronStateIntronState
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
![Page 32: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/32.jpg)
CS262 Lecture 9, Win07, Batzoglou
Duration HMMs for Gene Recognition
TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
Duration d
iPINTRON(xi | xi-1…xi-w)
PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)
j+2
P5’SS(xi-3…xi+4)
PSTOP(xi-4…xi+3)
![Page 33: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/33.jpg)
CS262 Lecture 9, Win07, Batzoglou
Genscan
• Burge, 1997
• First competitive HMM-based gene finder, huge accuracy jump
• Only gene finder at the time, to predict partial genes and genes in both strands
Features– Duration HMM– Four different parameter sets
• Very low, low, med, high GC-content
![Page 34: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/34.jpg)
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
![Page 35: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/35.jpg)
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
• Hox cluster is an example where everything is conserved
![Page 36: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/36.jpg)
CS262 Lecture 9, Win07, Batzoglou
Patterns of Conservation
30% 1.3%
0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
![Page 37: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/37.jpg)
CS262 Lecture 9, Win07, Batzoglou
Comparison-based Gene Finders
• Rosetta, 2000• CEM, 2000
– First methods to apply comparative genomics (human-mouse) to improve gene prediction
• Twinscan, 2001– First HMM for comparative gene prediction in two genomes
• SLAM, 2002– Generalized pair-HMM for simultaneous alignment and gene
prediction in two genomes
• NSCAN, 2006– Best method to-date based on a phylo-HMM for multiple genome
gene prediction
![Page 38: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/38.jpg)
CS262 Lecture 9, Win07, Batzoglou
Twinscan
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns
ExampleHuman: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|
![Page 39: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/39.jpg)
CS262 Lecture 9, Win07, Batzoglou
SLAM – Generalized Pair HMM
d
e
Exon GPHMM1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
![Page 40: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/40.jpg)
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
• GENSCAN
• TWINSCAN
• N-SCAN
Target GGTGAGGTGACCAAGAACGTGTTGACAGTATarget GGTGAGGTGACCAAGAACGTGTTGACAGTA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA
...
),...,,...,|( 1 oiioiii TTP III
),...,|( 1 oiii TTTP
),...,,,...,|,( 11 oiioiiii TTTP III
Target sequence:
Informant sequences (vector):
Joint prediction (use phylo-HMM):
![Page 41: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/41.jpg)
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
XX
C YY
ZZ H
M R
)|()|()|(
)|()|()|()(
),,,,,,(
1
ZRPZMPYZP
YHPXYPXCPAP
ZYXRMCHP
XX
C
YY
ZZ
H
M R
)|()|()|(
)|()|()|()(
),,,,,,(
ZRPZMPXCP
YZPYXPHYPHP
ZYXRMCHP
![Page 42: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/42.jpg)
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison
GENSCANGeneralized HMMModels human sequence
TWINSCANGeneralized HMMModels human/mouse alignments
N-SCANPhylo-HMMModels multiple sequence evolution
GENSCANGeneralized HMMModels human sequence
TWINSCANGeneralized HMMModels human/mouse alignments
N-SCANPhylo-HMMModels multiple sequence evolution
NSCAN human/mouse
>Human/multiple
informants
![Page 43: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/43.jpg)
CS262 Lecture 9, Win07, Batzoglou
• 2-level architecture• No Phylo-HMM that models alignments
CONTRAST
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
SVMSVM SVMSVM
X
Y
a b a b
![Page 44: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/44.jpg)
CS262 Lecture 9, Win07, Batzoglou
CONTRAST
![Page 45: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/45.jpg)
CS262 Lecture 9, Win07, Batzoglou
• log P(y | x) ~ wTF(x, y)
• F(x, y) = i f(yi-1, yi, i, x)
• f(yi-1, yi, i, x):
1{yi-1 = INTRON, yi = EXON_FRAME_1}
1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT)
1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC)
(1-c)1{a<SVM_DONOR(i)<b} (optional) 1{EXON_FRAME_1, EST_EVIDENCE}
CONTRAST - Features
![Page 46: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/46.jpg)
CS262 Lecture 9, Win07, Batzoglou
• Accuracy increases as we add informants
• Diminishing returns after ~5 informants
CONTRAST – SVM accuracies
SN SP
![Page 47: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/47.jpg)
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Decoding
Viterbi Decoding:
maximize P(y | x)
Maximum Expected Boundary Accuracy Decoding:
maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)
Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))
![Page 48: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/48.jpg)
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Training
Maximum Conditional Likelihood Training:
maximize L(w) = Pw(y | x)
Maximum Expected Boundary Accuracy Training:
ExpectedBoundaryAccuracy(w) = i Accuracyi
Accuracyi = B 1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -
B’ ≠ B P(yi-1, yi is exon boundary B’ | x)
![Page 49: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/49.jpg)
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison
De NovoDe Novo
EST-assistedEST-assisted
HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken
HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken
![Page 50: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d3f5503460f94a190c6/html5/thumbnails/50.jpg)
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison