www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Combinatorial Pattern Matching
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Outline• Week 08:
• Quiz 2• Hash Tables• Repeat Finding• Exact Pattern Matching• Keyword Trees• Suffix Trees
• Week 09:• Heuristic Similarity Search Algorithms• Approximate String Matching• Filtration• Comparing a Sequence Against a Database• Algorithm behind BLAST• Statistics behind BLAST• PatternHunter and BLAT
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Quiz 2
•closed book •marked out of 10•worth 5% of final grade•40 minutes
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
timing
• begin
• 20 minutes to go
• 10 minutes to go
• 5 minutes
• STOP
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Genomic Repeats
• Example of repeats:• ATGGTCTAGGTCCTAGTGGTC
• Motivation to find them:
• Genomic rearrangements are often associated with repeats
• To trace evolutionary secrets
• Many tumors are characterized by an explosion of repeats
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Genomic Repeats
• The problem is often made more difficult by mutation: • ATGGTCTAGGACCTAGTGTTC
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
l-mer Repeats
• Long repeats are difficult to find• Short repeats are easy to find (e.g., with
hashing)• A simple approach to finding long repeats:
• Find exact repeats of short l-mers (l is usually 10–13)
• Use l-mer repeats to potentially extend into longer, maximal repeats
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
l-mer Repeats (cont’d)
• There are typically many locations where an l-mer is repeated:
GCTTACAGATTCAGTCTTACAGATGGT
• The 4-mer TTAC starts at locations 3 and 17
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Extending l-mer Repeats
GCTTACAGATTCAGTCTTACAGATGGT
• Extend these 4-mer matches:
GCTTACAGATTCAGTCTTACAGATGGT
• Maximal repeat: TTACAGAT
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Maximal Repeats
• To find maximal repeats in this way, we need ALL start locations of all l-mers in the genome
• Hashing lets us find repeats quickly in this manner
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Hashing: What is it?
• What does hashing do?
• For different data, it generates a unique integer
• We store data in an array at the unique integer index generated from the data
• Hashing is a very efficient way to store and retrieve data (often stated as O(1))
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Hashing: Definitions
• Hash table: array used in hashing
• Records: data stored in a hash table
• Keys: identifies sets of records
• Hash function: uses a key to generate an index to insert at in hash table
• Collision: when more than one record is mapped to the same index in the hash table
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Hashing: Example
• Where do the animals eat?
• Records: each animal
• Keys: where each animal eats
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Hashing DNA sequences
• Each l-mer can be translated into a binary string (A, T, C, G can be represented as 00, 01, 10, 11)
• After assigning a unique integer per l-mer it is easy to get all start locations of each l–mer in a genome
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Hashing: Maximal Repeats
• To find repeats in a genome:• For all l-mers in the genome, note the start
position and the sequence• Generate a hash table index for each
unique l-mer sequence• In each index of the hash table, store all
genome start locations of the l-mer which generated that index
• Extend l-mer repeats to maximal repeats
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Hashing: Collisions
• Dealing with collisions:
• “Chain” all start locations of l-mers (linked list)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Hashing: Summary
• When finding genomic repeats from l-mers:
• Generate a hash table index for each l-mer sequence
• In each index, store all genome start locations of the l-mer which generated that index
• Extend l-mer repeats to maximal repeats
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Pattern Matching
• What if, instead of finding repeats in a genome, we want to find all sequences in a database that contain a given pattern?
• This leads us to a different problem, the Pattern Matching Problem
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Pattern Matching Problem• Goal: Find all occurrences of a pattern in a text
• Input: Pattern p = p1…pn and text t = t1…tm
• Output: All positions 1< i < (m – n + 1) such that the n-letter substring of t starting at i matches p
• Motivation: Searching database for a known pattern
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Exact Pattern Matching: A Brute-Force Algorithm
PatternMatching(p,t)1 n length of pattern p2 m length of text t3 for i 1 to (m – n + 1)4 if ti…ti+n-1 = p
5 output i
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Exact Pattern Matching: An Example
• PatternMatching algorithm for:
• Pattern GCAT
• Text CGCATC
GCATCGCATC
GCATCGCATC
CGCATCGCAT
CGCATC
CGCATCGCAT
GCAT
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Exact Pattern Matching: Running Time
• PatternMatching runtime: O(nm)
• Probability-wise, it’s more like O(m)
• Rarely will there be close to n comparisons in line 4
• But a better solution is to use suffix trees
• Can solve problem in O(m) time
• Conceptually related to keyword trees (next...)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Example
• Keyword tree:
• Apple
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
• Banana
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
• Banana
• Bandana
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Example (cont’d)
• Keyword tree:
• Apple
• Apropos
• Banana
• Bandana
• Orange
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Properties
• Stores a set of keywords in a rooted labeled tree
• Each edge labeled with a letter from an alphabet
• Any two edges coming out of the same vertex have distinct labels
• Every keyword stored can be spelled on a path from root to some leaf
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “appeal”
• appeal
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “appeal”
• appeal
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “appeal”
• appeal
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “appeal”
• appeal
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “apple”
• apple
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “apple”
• apple
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “apple”
• apple
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “apple”
• apple
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Thread “apple”
• apple
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Multiple Pattern Matching Problem
• Goal: Given a set of patterns and a text, find all occurrences of any of patterns in text
• Input: k patterns p1,…,pk, and text t = t1…tm
• Output: Positions 1 < i < m where substring of t starting at i matches pj for 1 < j < k
• Motivation: Searching database for known multiple patterns
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Multiple Pattern Matching: Straightforward Approach• Can solve as k “Pattern Matching Problems”
• Runtime:
O(kmn)
using the PatternMatching algorithm k times
• m - length of the text
• n - average length of the pattern
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Multiple Pattern Matching: Keyword Tree Approach• Or, we could use keyword trees:
• Build keyword tree in O(N) time; N is total length of all patterns
• With naive threading: O(N + nm)
• Aho-Corasick algorithm: O(N + m)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading
• To match patterns in a text using a keyword tree:
• Build keyword tree of patterns
• “Thread” the text through the keyword tree
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Keyword Trees: Threading (cont’d)
• Threading is “complete” when we reach a leaf in the keyword tree
• When threading is “complete,” we’ve found a pattern in the text
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Suffix Trees=Collapsed Keyword Trees
• Similar to keyword trees, except edges that form paths are collapsed
• Each edge is labeled with a substring of a text
• All internal edges have at least two outgoing edges
• Leaves labeled by the index of the pattern.
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes
ATCATG TCATG CATG ATG TG G
Keyword Tree
Suffix Tree
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes
ATCATG TCATG CATG ATG TG G
Keyword Tree
Suffix Tree
How much time does it take?
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes
ATCATG TCATG CATG ATG TG G
quadratic Keyword Tree
Suffix Tree
Time is linear in the total size of all suffixes, i.e., it is quadratic in the length of the text
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Suffix Trees: Advantages
• Suffix trees of a text is constructed for all its suffixes • Suffix trees build faster than keyword trees
ATCATG TCATG CATG ATG TG G
quadratic Keyword Tree
Suffix Tree
linear (Weiner suffix tree algorithm)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Use of Suffix Trees
• Suffix trees hold all suffixes of a text• i.e., ATCGC: ATCGC, TCGC, CGC, GC, C• Builds in O(m) time for text of length m
• To find any pattern of length n in a text:• Build suffix tree for text• Thread the pattern through the suffix tree• Can find pattern in text in O(n) time!
• O(n + m) time for “Pattern Matching Problem”• Build suffix tree and lookup pattern
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Pattern Matching with Suffix TreesSuffixTreePatternMatching(p,t)1 Build suffix tree for text t2 Thread pattern p through suffix tree3 if threading is complete4 output positions of all p-matching leaves in the
tree5 else6 output “Pattern does not appear in text”
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Suffix Trees: Example
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Suffix arrays
• A related structure is the suffix array.• It can also be constructed in linear time and was
developed to save space.• The space requirement of a suffix array is
significantly less than that for a suffix tree.• The suffix array also forms the basis for the
Burrows-Wheeler Transform (BWT) which is behind the operation of the assembly program Bowtie
• Do look it up: it’s cool, even if it’s BTSOTC.
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Multiple Pattern Matching: Summary
• Keyword and suffix trees are used to find patterns in a text
• Keyword trees:
• Build keyword tree of patterns, and thread text through it
• Suffix trees:
• Build suffix tree of text, and thread patterns through it
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Approximate vs. Exact Pattern Matching• So far all we’ve seen are exact pattern
matching algorithms
• Usually, because of mutations, it makes much more biological sense to find approximate pattern matches
• Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Heuristic Similarity Searches
• Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow
• Alignment of two sequences usually has short identical or highly similar fragments
• Many heuristic methods (i.e., FASTA) are based on the same idea of filtration:
• Find short exact matches, and use them as seeds for potential match extension
• “Filter” out positions with no extendable matches
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Dot Matrices
• Dot matrices show similarities between two sequences
• FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Dot Matrices (cont’d)
• Identify diagonals above a threshold length
• Diagonals in the dot matrix indicate exact substring matching
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Diagonals in Dot Matrices
• Extend diagonals and try to link them together, allowing for minimal mismatches/indels
• Linking diagonals reveals approximate matches over longer substrings
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Approximate Pattern Matching Problem
• Goal: Find all approximate occurrences of a pattern in a text
• Input: A pattern p = p1…pn, text t = t1…tm, and k, the maximum number of mismatches
• Output: All positions 1 < i < (m – n + 1) such that ti…ti+n-1 and p1…pn have at most k mismatches (i.e., Hamming distance between ti…ti+n-1 and p < k)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Approximate Pattern Matching: A Brute-Force Algorithm
ApproximatePatternMatching(p, t, k)1 n length of pattern p2 m length of text t3 for i 1 to m – n + 14 dist 05 for j 1 to n6 if ti+j-1 != pj
7 dist dist + 18 if dist < k9 output i
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Approximate Pattern Matching: Running Time
• That algorithm runs in O(nm).• Landau-Vishkin algorithm: O(kn)• We can generalize the “Approximate Pattern
Matching Problem” into a “Query Matching Problem”:• We want to match substrings in a query to
substrings in a text with at most k mismatches• Motivation: we want to see similarities to
some gene, but we may not know which parts of the gene to look for
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Query Matching Problem
• Goal: Find all substrings of the query that approximately match the text
• Input: Query q = q1…qw, text t = t1…tm, n (length of matching substrings), k (maximum number of mismatches)• Output: All pairs of positions (i, j) such that the n-letter substring of q starting at i
approximately matches the n-letter substring of t starting at j, with at most k mismatches
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Approximate Pattern Matching vs Query Matching
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Query Matching: Main Idea
• Approximately matching strings share some perfectly matching substrings.
• Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy).
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Filtration in Query Matching
• We want all n-matches between a query and a text with up to k mismatches
• “Filter” out positions we know do not match between text and query
• Potential match detection: find all matches of l-tuples in query and text for some small l
• Potential match verification: Verify each potential match by extending it to the left and right, until (k+1) mismatches are found
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Filtration: Match Detection
• If x1…xn and y1…yn match with at most k mismatches, they must share an l-tuple that is perfectly matched, with l = n/(k+1) *
• Break string of length n into k+1 parts, each of length n/(k + 1)• k mismatches can affect at most k of these
k+1 parts
• At least one of these k+1 parts is perfectly matched (eh?)
* x is defined as the largest integer no bigger than x, so 1.2 = 1
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Filtration: Match Detection (cont’d)
• Suppose k = 3. We would then have l=n/(k+1)=n/4:
• There are at most k mismatches in n, so at the very least there must be one out of the k+1 l-tuples without a mismatch (try putting k mismatches in the array such that no two of them are further than l apart)
1…l l +1…2l 2l +1…3l 3l +1…n
1 2 k k + 1
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Filtration: Match Verification
• For each l -match we find, try to extend the match further to see if it is substantial
query
Extend perfect match of length luntil we find an approximate match of length n with k mismatchestext
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Filtration: Example
k = 0 k = 1 k = 2 k = 3 k = 4 k = 5
l-tuplelength n n/2 n/3 n/4 n/5 n/6
Shorter perfect matches required
Performance decreases
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Local alignment is too slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g., the entire GenBank database)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Local alignment is too slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g., the entire GenBank database)
• But:• Guaranteed to find the optimal local
alignment
• Sets the standard for sensitivity
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Local alignment is too slow…
• Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database)
• Basic Local Alignment Search Tool• Altschul, S., Gish, W., Miller, W.,
Myers, E. & Lipman, D.J.
Journal of Mol. Biol., 1990• Search sequence databases for
local alignments to a query
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAST
• Great improvement in speed, with a modest decrease in sensitivity
• Minimizes search space instead of exploring entire search space between two sequences
• Finds short exact matches (“seeds”), only explores locally around these “hits”
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
What Similarity Reveals
• BLASTing a new gene
• Evolutionary relationship
• Similarity between protein function
• BLASTing a genome
• Potential genes
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAST algorithm
• Keyword search of all words of length w from the query of length n in database of length m with score above threshold
• w = 11 for DNA queries, w = 3 for proteins
• Local alignment extension for each found keyword
• Extend result until longest match above threshold is achieved
• Running time O(nm)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAST algorithm (cont’d)
Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD
keyword
GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11
neighbourhoodscore threshold
(T = 13)
Neighbourhoodwords
High-scoring Pair (HSP)
extension
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Original BLAST
• Dictionary
• All words of length w
• Alignment
• Ungapped extensions until score falls below some statistical threshold
• Output
• All local alignments with score > threshold
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Original BLAST: ExampleA C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
A• w = 4
• Exact keyword match of GGTC
• Extend diagonals with mismatches until score is under 50%
• Output resultGTAAGGTCCGTTAGGTCC
From lectures by Serafim Batzoglou (Stanford)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Gapped BLAST : Example• Original BLAST
exact keyword search, THEN:
• Extend with gaps around ends of exact match until score < threshold
• Output resultGTAAGGTCCAGTGTTAGGTC-AGT
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
A
From lectures by Serafim Batzoglou (Stanford)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Incarnations of BLAST
• blastn: Nucleotide-nucleotide
• blastp: Protein-protein
• blastx: Translated query vs. protein database
• tblastn: Protein query vs. translated database
• tblastx: Translated query vs. translated
database (6 frames each)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Incarnations of BLAST (cont’d)
• PSI-BLAST• Find members of a protein family or build a
custom position-specific score matrix• Megablast:
• Search longer sequences with fewer differences
• WU-BLAST: (Wash U BLAST)• Optimized, added features
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Assessing sequence similarity
• We need to know how strong an alignment can be expected from chance alone
• “Chance” relates to comparison of sequences that are generated randomly based upon a certain sequence model
• Sequence models may take into account: • G+C content• Poly-A tails• “Junk” DNA • Codon bias• Etc.
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAST: Segment Score
• BLAST uses scoring matrices () to improve on efficiency of match detection• Some proteins may have very different
amino acid sequences, but are still similar
• For any two l-mers x1…xl and y1…yl :• Segment pair: pair of l-mers, one from each
sequence• Segment score: li=1 (xi, yi)
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAST: Locally Maximal Segment Pairs
• A segment pair is maximal if it has the best score over all segment pairs
• A segment pair is locally maximal if its score can’t be improved by extending or shortening
• Statistically significant locally maximal segment pairs are of biological interest
• BLAST finds all locally maximal segment pairs with scores above some threshold• A significantly high threshold will filter out
some statistically insignificant matches
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAST: Statistics
• Threshold: Altschul-Dembo-Karlin statistics• Identifies smallest segment score that is unlikely to
happen by chance• # matches above has mean E() = Kmne-; K is a
constant, m and n are the lengths of the two compared sequences
• Parameter is positive root of
where px and py are frequencies of amino acids x and y, and A is the twenty letter amino acid alphabet
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
P-values• The probability of finding b HSPs with a
score ≥ S is given by
• For b = 0, that chance is just
• Thus the probability of finding at least one HSP with a score ≥S is
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Sample BLAST output Score E
Sequences producing significant alignments: (bits) Value
gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43
ALIGNMENTS>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]Length = 148
Score = 171 bits (434), Expect = 3e-44 Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)
Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60 MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPKSbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60
Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120 V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FGSbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120
Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 + F VQ A+QK +A V +AL +YHSbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148
• Blast of human beta globin protein against zebra fish
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Sample BLAST output (cont’d)
Score ESequences producing significant alignments: (bits) Value
gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33
ALIGNMENTS>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11 Length = 81706 Score = 149 bits (75), Expect = 3e-33 Identities = 183/219 (83%) Strand = Plus / Plus Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326 || ||| | || | || | |||||| ||||| ||||||||||| |||||||| Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468
Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 ||||||||| |||||||||| ||||| ||||||||||||Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507
• Blast of human beta globin DNA against human DNA
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Timeline
• 1970: Needleman-Wunsch global alignment algorithm
• 1981: Smith-Waterman local alignment algorithm• 1985: FASTA• 1990: BLAST (basic local alignment search tool)• 2000s: BLAST has become too slow in “genome vs.
genome” comparisons - new faster algorithms evolve!• PatternHunter• BLAT
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
PatternHunter: faster and even more sensitive• BLAST: matches short
consecutive sequences (consecutive seed)
• Length = k
• Example (k = 11):
11111111111
Each 1 represents a “match”
• PatternHunter: matches short non-consecutive sequences (spaced seed)
• Increases sensitivity by locating homologies that would otherwise be missed
• Example (a spaced seed of length 18 with 11 “matches”):
111010010100110111
Each 0 represents a “don’t care”, so there can be a match or a mismatch
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Spaced seeds
Example of a hit using a spaced seed:
How does this result in better sensitivity?
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Why is PH better?
• BLAST: redundant hits
PatternHunter
This results in > 1 hit and creates clusters of redundant hits
This results in very few redundant hits
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Why is PH better?
BLAST may also miss a hitGAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT
|| ||||||||| |||||| | |||||| ||||||
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
In this example, despite a clear homology, there is no sequence of continuous matches longer than length 9. BLAST uses a length 11 and because of this, BLAST does not recognize this as a hit!
Resolving this would require reducing the seed length to 9, which would have a damaging effect on speed
9 9 matchesmatches
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Why is PH better?
• Higher hit probability
• Lower expected number of random hits
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Use of Multiple Seeds
Basic Searching Algorithm
1. Select a group of spaced seed models
2. For each hit of each model, conduct extension to find a homology.
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Another method: BLAT
• BLAT (BLAST-Like Alignment Tool)
• Same idea as BLAST - locate short sequence hits and extend
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAT vs. BLAST: Differences
• BLAT builds an index of the database and scans linearly through the query sequence,
whereas
• BLAST builds an index of the query sequence and then scans linearly through the database
• Index is stored in RAM which is memory intensive, but results in faster searches
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAT: Fast cDNA Alignments
Steps:1. Break cDNA into 500 base chunks.2. Use an index to find regions in genome similar to
each chunk of cDNA.3. Do a detailed alignment between genomic regions
and cDNA chunk.4. Use dynamic programming to stitch together
detailed alignments of chunks into detailed alignment of whole.
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
BLAT: Indexing
• An index is built that contains the positions of each k-mer in the genome
• Each k-mer in the query sequence is compared to each k-mer in the index
• A list of ‘hits’ is generated - positions in cDNA and in genome that match for k bases
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Indexing: An ExampleHere is an example with k = 3:
Genome: cacaattatcacgaccgc3-mers (non-overlapping): cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15
cDNA (query sequence): aattctcac3-mers (overlapping): aat att ttc tct ctc tca cac 0 1 2 3 4 5 6
Hits: aat 0,3 cac 6,0 cac 6,9 clump: cacAATtatCACgaccgc
Multiple instances map to single index
Position of 3-mer in query, genome
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
However…
• BLAT was designed to find sequences of 95% and greater similarity of length >40; may miss more divergent or shorter sequence alignments
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
PatternHunter and BLAT vs. BLAST
• PatternHunter is 5-100 times faster than Blastn, depending on data size, at the same sensitivity
• BLAT is several times faster than BLAST, but best results are limited to closely related sequences
www.bioalgorithms.infoCOMP3456 – adapted from textbook slides
Resources
• tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt• http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf• http://www.genomeblat.com/genomeblat/blatRapShow.pps
Top Related