C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3:...
-
date post
18-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3:...
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
02-11-2006
Alignments 3:BLAST
Sequence Analysis
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[2] 09-11-2006 Sequence Analysis
Sequence searching - challenges• Exponential growth of databases
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[3] 09-11-2006 Sequence Analysis
Sequence searching – definition
• Task:
• Query: short, new sequence (~1000b)
• Database (searching space): very many sequences
• Goal: find seqs related to query
• We want:
• fast tool
• primarily a filter: most sequences will be unrelated to the query
• fine-tune the alignment later
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[4] 09-11-2006 Sequence Analysis
•dynamic programming has performance O(mn) which is too slow for large databases with high query traffic
– MPsrch [Sturrock & Collins, MPsrch version 1.3 (1993) – Massively parallel DP]
•heuristic methods do fast approximation to dynamic programming
– FASTA [Pearson & Lipman, 1988]
– BLAST [Altschul et al., 1990]
Heuristic Alignment Motivation
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[5] 09-11-2006 Sequence Analysis
Heuristic Alignment Motivation• consider the task of searching SWISS-PROT against a query sequence:• say our query sequence is 362 amino-acids long• SWISS-PROT release 38 contains 29,085,265 amino
acids• finding local alignments via dynamic programming
would entail O(1010) matrix operations• many servers handle thousands of such queries a day
(NCBI > 50,000)• Using the DP algorithm for this is clearly prohibitive • Note: each database search can be sped up by ‘trivial
parallelisation”
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[6] 09-11-2006 Sequence Analysis
Heuristic Alignment
• Today: BLAST is discussed to show you a few of the tricks people have come up with to make alignment and database searching fast, while not losing too much quality.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[7] 09-11-2006 Sequence Analysis
What is BLAST• Basic Local Alignment Search Tool
• Bad news: it is only a heuristic• Heuristics: A rule of thumb that often helps in solving
a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work
• Also see http://en.wikipedia.org/wiki/Heuristic
• Basic idea:• High scoring segments have well conserved (almost
identical) part
• As well conserved parts are identified, extend these to the real alignment
q
e
s
-
euqes-
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[8] 09-11-2006 Sequence Analysis
What means well conserved for BLAST?
• BLAST works with k-words (words of length k)
• k is a parameter
• different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp.
• word w1 is T-similar to w2 if the sum of pair scores is
at least T (e.g. T=12)Similar 3-words
W1: R K PW2: R R PScore: 9 –1 7 = 15
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[9] 09-11-2006 Sequence Analysis
BLAST algorithm3 basic steps
1)Preprocess the query: extract all the k-words
2)Scan for T-similar matches in database
3)Extend them to alignments
1) Preprocess2) Scan3) Extend
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[10] 09-11-2006 Sequence Analysis
BLAST, Step 1: Preprocess the query
• Take the query (e.g. LVNRKPVVP)
• Chop it into overlapping k-words (k=3 in this case)
• For each word find all similar words (scoring at least T)
• E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP
1) Preprocess2) Scan3) Extend
Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[11] 09-11-2006 Sequence Analysis
Step 2: Scanning the Database with DFA (Deterministic Finite-state Automaton)• search database for all occurrences of query
words• can be a massive task• approach:
• build a DFA (deterministic finite-state automaton) that recognizes all query words
• run DB sequences through DFA• remember hits
1) Preprocess2) Scan3) Extend
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[12] 09-11-2006 Sequence Analysis
DFA Finite state machine
AC*T|GGC
• abstract machine
• constant amount of memory (states)
• used in computation and languages
• recognizes regular expressions
• cp dmt*.pdf /home/john
1) Preprocess2) Scan3) Extend
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[13] 09-11-2006 Sequence Analysis
BLAST, Step 2: Find “exact” matches with scanning • Use all the T-similar k-words to build
the Finite State Machine
• Scan for exact matches
...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...
QKPKKPRQPREPRRPRKP...
movement
1) Preprocess2) Scan3) Extend
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[14] 09-11-2006 Sequence Analysis
Scanning the Database - DFA
Example (next 2 slides):• consider a DFA to recognize the query words: QL, QM, ZL
• All that a DFA does is read strings, and output "accept" or "reject."
• use Mealy paradigm (accept on transitions) to save space and time
Moore paradigm: the alphabet is (a, b), the states are q0, q1, and q2, the start state is q0 (denoted by the arrow coming from nowhere), the only accepting state is q2 (denoted by the double ring around the state), and the transitions are the arrows. The machine works as follows. Given an input string, we start at the start state, and read in each character one at a time, jumping from state to state as directed by the transitions. When we run out of input, we check to see if we are in an accept state. If we are, then we accept. If not, we reject.
Moore paradigm: accept/reject states
Mealy paradigm: accept/reject transitions
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[15] 09-11-2006 Sequence Analysis
a DFA to recognize the query words: QL, QM, ZL in a fast wayQ
ZL o
r MQ
not (
L or M
or Q
)
Z
L
not (L or Z)
Mealy paradigm
not (Q or Z)
Accept on red transitions
start
This DFA is downloaded from expert website, but what do you think (see next..)?
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[16] 09-11-2006 Sequence Analysis
a DFA to recognize the query words: QL, QM, ZL in a fast wayQ
ZL o
r MQ
not (
L or M
or Q
or Z
)
Z
L
not (L or Z or Q)
Mealy paradigm
not (Q or Z)
Accept on red transitions
start ZQ
spot and justify the differences with the last slide..
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[17] 09-11-2006 Sequence Analysis
BLAST, Step 3: Extending “exact” matches• Having the list of matches (hits) we extend
alignment in both directions
Query: L V N R K P V V PT-similar: R R PSubject: G V C R R P L K CScore: -3 4 -3 5 2 7 1 -2 -3
1) Preprocess2) Scan3) Extend
• …till the sum of scores drops below some level X from the best known
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[18] 09-11-2006 Sequence Analysis
Step 3: Extending Hits• extend hits in both directions (without allowing
gaps)
• terminate extension in one direction when score falls certain distance below best score for shorter extensions
• return segment pairs scoring at least S
1) Preprocess2) Scan3) Extend
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[19] 09-11-2006 Sequence Analysis
More Recent BLAST Extensions• the two-hit method
• gapped BLAST• hashing the database• PSI-BLAST
all are aimed at increasing sensitivity while keeping run-times minimal
Altschul et al., Nucleic Acids Research 1997
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[20] 09-11-2006 Sequence Analysis
The Two-Hit Method• extension step typically accounts for
90% of BLAST’s execution time• key idea: do extension only when there
are two hits on the same diagonal within distance A of each other
• to maintain sensitivity, lower T parameter• more single hits found• but only small fraction have associated
2nd hit
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[21] 09-11-2006 Sequence Analysis
The Two-Hit Method
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[22] 09-11-2006 Sequence Analysis
Gapped BLAST• trigger gapped alignment if two-hit
extension has a sufficiently high score• find length-11 segment with highest score;
use central pair in this segment as seed• run DP process both forward & backward
from seed• prune cells when local alignment score falls
a certain distance below best score yet
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[23] 09-11-2006 Sequence Analysis
Gapped BLAST
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[24] 09-11-2006 Sequence Analysis
Combining the two-hit method and Gapped BLAST • Before:
• relatively high T threshold for 3-letter word (hashed) lists
• two-way hit extension (see earlier slides)• Current BLAST:
• Lower T: many more hits (more 3-letter words accepted as match)
• Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A
• Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet)
The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[25] 09-11-2006 Sequence Analysis
Hashing – associative arrays
• Indexing with the object, the
• Hash function:
• Objects should be “well spread”
hash:
x
set of possible objects - largesmall
(fits in memory)
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[26] 09-11-2006 Sequence Analysis
Hashing - examples
• T9 Predictive Text in mobile phones
• “hello”:4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6
• “hello” in T9: 4, 3, 5, 5, 6
• Collisions: 4, 6:“in”, “go”
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[27] 09-11-2006 Sequence Analysis
Hashing – examples (cont..)• Other easier hash function: let a=1, b=2, c=3,
etc.
• “hello” now gets hash address 8+5+12+12+15 = 52
• “olleh” will get same address (collision)
• Each word encountered gets a hash address immediately and can be indexed.
• How good is this hash function?
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[28] 09-11-2006 Sequence Analysis
BLAST, Step 2: Find ”exact” matches with hashing
• Preprocess the database
• Hash the database with k-words
• For each k-word store in which sequences it appears
k-word: RKP
Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......
1) Preprocess2) Scan3) Extend
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[29] 09-11-2006 Sequence Analysis
BLAST, Step 2: Find “exact” matches with hashing
• The database is preprocessed only once! (independent from the query)
• In a constant time we can get the sequences with a certain k-word
k-word: RKP
Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[30] 09-11-2006 Sequence Analysis
BLAST flavours• blastp: protein query, protein db• blastn: DNA query, DNA db• blastx: DNA query, protein db
• in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.
• tblastn: protein query, DNA db
• database dynamically translated in all reading frames.• tblastx: DNA query, DNA db
• all translations of query against all translations of db (compare at protein level)
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[31] 09-11-2006 Sequence Analysis
PSI-BLAST• Position-Specific Iterated BLAST
• A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence)
• Database is searched against the profile (instead of a sequence) in subsequent rounds
• Up to 3-10 iterations are recommended
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[32] 09-11-2006 Sequence Analysis
1. Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment.
2. The program then initially operates on a single query sequence by performing a gapped BLAST search
3. Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master-slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment.
4. The database is rescanned in a subsequent round, now using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged
PSI-BLAST steps in words
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[33] 09-11-2006 Sequence Analysis
Profile
• a Profile is a generalized form of sequence
• probabilities instead of a letter
ACDWY
0.30.10..0.30.3
0.500..00.5
00.50.2..0.10.2
0.20.00.1..0.40.3
...
...
...
...
...
...
...
...
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[34] 09-11-2006 Sequence Analysis
Constructing a profile
• Take significant BLAST hits
• Make an alignment
• Assign weights to sequences
• Construct profile
ACDWY
0.30.10..0.30.3
0.500..00.5
00.50.2..0.10.2
0.20.00.1..0.40.3
...
...
...
...
...
...
...
...
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[35] 09-11-2006 Sequence Analysis
PSI BLAST:Constructing the Profile Matrix
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[36] 09-11-2006 Sequence Analysis
1 2 3 4 5 Overall
A .17 .33 .17 .17 .17 6/30 = .20
C .17 .17 .17 .50 .50 9/30 = .30
G .50 .17 .17 .17 .17 7/30 = .23
T .17 .33 .50 .17 .17 8/30 = .27
12345S1 GCTCC S2 AATCGS3 TACGCS4 GTGTTS5 GTAAAS6 CGTCC
1 2 3 4 5 Overall
A .85 1.65 .85 .85 .85 6/30 = .20
C .57 .57 .57 1.67 1.67 9/30 = .30
G 2.17 .74 .74 .74 .74 7/30 = .23
T .63 1.22 1.85 .63 .63 8/30 = .27
1 2 3 4 5
A -0.23 0.72 -0.23 -0.23 -0.23
C -0.81 -0.81 -0.81 0.74 0.74
G 1.11 -0.43 -0.43 -0.43 -0.43
T -0.66 0.29 0.89 -0.66 -0.66
Normalise by dividing by overall frequencies
Convert to log to base of 2
1 2 3 4 5
A -0.23 0.72 -0.23 -0.23 -0.23
C -0.81 -0.81 -0.81 0.74 0.74
G 1.11 -0.43 -0.43 -0.43 -0.43
T -0.66 0.29 0.89 -0.66 -0.66
Match GATCA to PSSM
Score = 1.11 + 0.72 + 0.89 + 0.74 - 0.23 = 3.23
Find nucleotides at corresponding positions
Sum corresponding log odds matrix scores
(A)
(B)
Profile calculation example using frequency normalisation and log conversionp
rofi
le
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[37] 09-11-2006 Sequence Analysis
PSI BLAST: Determining profile elements more reliably using pseudo-counts• the value for a given element of the profile
matrix is given by:
• where the probability of seeing amino acid ai in column j is estimated as:
Observed frequency
Pseudocount (e.g. database frequency)
e.g. = number of sequences in profile, =1
Overall alignment frequency (preceding slide)
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[38] 09-11-2006 Sequence Analysis
PSI BLAST: Determining profile elements more reliably using pseudo-countsPseudo-counts:
• mix observed a.a. frequencies with prior (e.g. database) frequencies
•drawback is pulling all frequencies to prior frequencies, which reduces differences
• are useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet
•with greater numbers of sequences in the MSA, the profile becomes less dependent
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[39] 09-11-2006 Sequence Analysis
PSI-BLAST iteration graphic…
Q
ACD..Y
Query sequence
PSSM
Q Query sequence
Gapped BLAST search
Database hits
Gapped BLAST search
ACD..Y
PSSM
Database hits
xxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
iterate
Low-complexity
region
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[40] 09-11-2006 Sequence Analysis
DBT
hits
PSSM
Q
Discarded sequences
Run query sequence against
database
Run PSSM against database
Another PSI-BLAST iteration graphic…
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[41] 09-11-2006 Sequence Analysis
(A) (B)
(C) (D)
Figure 6
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[42] 09-11-2006 Sequence Analysis
PSI-BLAST entry page
Paste your query sequence
Switch this off for default run
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[43] 09-11-2006 Sequence Analysis
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[44] 09-11-2006 Sequence Analysis
1 - This portion of each description links to the sequence record for a particular hit.
2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence).
3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e-117 meaning that a sequence with a similar score is very unlikely to occur simply by chance.
4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[45] 09-11-2006 Sequence Analysis
‘X’ residues denote low-complexity sequence fragments that are ignored
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[46] 09-11-2006 Sequence Analysis
Alignment Bit Score
•S is the raw alignment score •The bit score (‘bits’) B has a standard set of units•The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment and K and are the statistical parameters of the scoring system (BLOSUM62 in Blast). •See Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices. •Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices)
B = (S – ln K) / ln 2
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[47] 09-11-2006 Sequence Analysis
What is the statistical significance of an alignment• To get a null model: extract local
alignments from random sequences
• P-value• The probability of obtaining the result by
pure chance• An alignment giving a lower P-value than
a threshold value set by the user is considered a hit.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[48] 09-11-2006 Sequence Analysis
Normalised sequence similarity
The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences.
This probability follows the Poisson distribution (Waterman and Vingron, 1994):
P(x, n) = 1 – e-nP(S x),
where n is the number of sequences in the database
Depending on x and n (fixed)
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[49] 09-11-2006 Sequence Analysis
E-value• The concept of P-value applies to single
comparisons
• What with searching in a large database?
Task.
Having a protein, we want to find similar ones in a large database (1mln sequences). We are interested in
P-value < 0.01Count the number of hits we’ll get by chance alone.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[50] 09-11-2006 Sequence Analysis
Normalised sequence similarityStatistical significance
The E-value is defined as the expected number of non-homologous sequences with score greater than or equal to a score x in a database of n sequences:
E(x, n) = nP(S x)
For example, if E-value = 0.01, then the expected number of random hits with score S x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database.if the E-value of a hit is 5, then five fortuitous hits with S x are expected within a single database search, which renders the hit not significant.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[51] 09-11-2006 Sequence Analysis
A model for database searching score probabilities• Scores resulting from searching with a query
sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955).
• Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[52] 09-11-2006 Sequence Analysis
Extreme Value Distribution
Probability density function for the extreme value distribution resulting from parameter values = 0 and = 1, [y = 1 – exp(-e-x)], where is the characteristic value and is the decay constant.
y = 1 – exp(-e-(x-))
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[53] 09-11-2006 Sequence Analysis
Extreme Value Distribution (EDV)
You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit.
real data
EDV approximation
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[54] 09-11-2006 Sequence Analysis
Extreme Value DistributionThe probability of a score S to be larger than a
given value x can be calculated following the EDV as:
E-value: P(S x) = 1 – exp(-e -(x-)),
where =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices).
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[55] 09-11-2006 Sequence Analysis
Extreme Value DistributionUsing the equation for (preceding slide), the probability
for the raw alignment score S becomes
P(S x) = 1 – exp(-Kmne-x).
In practice, the probability P(Sx) is estimated using the approximation 1 – exp(-e-x) e-x, which is valid for large values of x. This leads to a simplification of the equation for P(Sx):
P(S x) e-(x-) = Kmne-x.
The lower the probability (E value) for a given threshold value x, the more significant the score S.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[56] 09-11-2006 Sequence Analysis
Normalised sequence similarityStatistical significance• Database searching is commonly
performed using an E-value in between 0.1 and 0.001.
• Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[57] 09-11-2006 Sequence Analysis
Words of Encouragement• “There are three kinds of lies: lies,
damned lies, and statistics” – Benjamin Disraeli
• “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination”
• “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[58] 09-11-2006 Sequence Analysis
Database Search Algorithms:Sensitivity, Selectivity• Sensitivity – the ability to detect weak similarities between sequences
(often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to the similar to the query, but rejected. Sensitivity = TP / (TP+FN)
• Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity = TP / (TP + FP)
SensitivitySensitivity
SelectivitySelectivity
Courtesy of Gary Benson (ISSCB 2003)
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[59] 09-11-2006 Sequence Analysis
Dot-plotsa simple way to visualise sequence similarity
Can be a bit messy, though...Filter: 6/10 residues have to match...
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[60] 09-11-2006 Sequence Analysis
Dot-plots, what about...• Insertions/deletions -- DNA and proteins
• Duplications (e.g. tandem repeats) – DNA and proteins
• Inversions -- DNA
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[61] 09-11-2006 Sequence Analysis
Dot-plots, self-comparison
Direct repeatDirect repeat
Tandem repeatTandem repeat
Inverted repeatInverted repeat
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
[63] 09-11-2006 Sequence Analysis
• END