Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA
description
Transcript of Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA
1
Pairwise alignment 2:Scoring matrices and gaps BLAST, BLAT and FASTA
Introduction to bioinformatics
Stinus [email protected]
Bioinformatics Centre, University of Copenhagen
2
Outline of the lecture Scoring an alignment: BLOSUM and PAM Scoring matrices for nucleotides Treatment of gaps The BLAST algorithm Position Specific Scoring Matrices PSI-BLAST and other variants BLAT FASTA
3
Definition of scoring systemTo build an alignment we need a scoring system Aligning two different residues indicates a
substitution How likely is it to replace A with B? A likely alignment should receive a large score
Create a substitution matrix A 20×20 symmetrical table (matrix) Look up the score for aligning two amino acids The diagonal represents conservation
4
ConsiderationsConservative substitutions Prefer changes that respect structure and
function E.g. hydrophobicity, size, chemical properties etc.
Frequencies How often do specific residues occur? Scores should weigh the rare ones higher
Evolution How distant are the sequences? Ancient divergence more changes
5
The creationApproach 1: Make an evolutionary model Use closely related sequences Extrapolate to greater distances Done in the PAM family
Approach 2: Look at related sequences Observe actual mutations in motifs Use sets of different overall identity Done in the BLOSUM family
6
[1] M.O. Dayhoff: Survey of new data and computer methods of analysis (1978), Atlas of protein sequence and structure, 5:3
The PAM250 matrix[1]
7
The PAM familyDefining characteristics PAM: Point (or percent) accepted mutations One PAM unit: 1 change per 100 residues (~
10M ys) Not the probability that a residue A is
something else What would PAM250 mean in that case?
But: The number of expected changes overall Including changes A B A
8
Construction of PAM matricesThe general idea: Assume evolution is independent Describe ”one step of evolution” The next step will follow
PAMx: Scoring matrix for x evolutionary steps
Very similar sequences: Low x Very divergent sequences: Large x
What is an evolutionary step?
9
Evolutionary modelAssume independence between positions over time P(AB)=P(BA)Consequences: How a sequence looks tomorrow depends
ONLY on how it looks today Not on how it looked yesterday Known as a Markov chain
The context does not matter Evolution ”reversible”
10
Building PAM1
1. Find set of very similar sequences (1% ID)2. Make global alignment (71 groups)3. Count the substitutions (1572 changes)4. Calculate weights
11
From counts to scores
The simplified version Count mutation frequencies
The respective probabilities Normalize by mutability of respective amino
acid Makes comparisons possible
Multiply to wanted evolutionary distance x Still probabilities
Calculate log-odds score for (AB+BA)/2 You have the PAMx matrix!
12
Why log-odds?
Summation instead of multiplication log(A·B)=log(A)+log(B)
S>0: More often than per chance S=0: Number expected by chance S<0: Less often than per chance
ji
jiji pp
qS
,
, log
13
The PAM1 matrixHighly identical sequences 1 evolutionary step PAM1 describes likelihood of changes Diagonal close to 1, off-diagonal close to 0 PAM2=PAM1×PAM1 PAM3=PAM1×PAM2 …
Markov chain property (independence) Greater divergence through matrix
multiplication Large PAM: More even values
14
PAM problems
Good and bad things Based on evolution Assumes independence Global data evolution of proteins Based on small dataset Extrapolates to greater divergence
information loss & error growth Evolutionary model partly verified But still simplified… Mutations happen at the nucleotide level
15
[2] S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s
The BLOSUM matrix[2]
16
The BLOSUM family
Look at conserved motifs in proteins Scan databases for motifs No gaps in alignment a BLOCK
One protein can contain many motifs Make groups of motifs
~2000 blocks, >500 protein families The BLOCKS database BLOSUM: BLOCKS substitution matrices
17
The BLOSUM idea
Don’t extrapolate data Collect divergent dataset Observe mutations directly Local alignment of sequences Evolution treated implicitly Still assumes independence between sites
18
Building BLOSUMx
Substitution scores for sequences of x %ID1. Create sets of blocks of x %ID2. Very similar sequences are grouped and
weighted Avoid bias due to numbers
3. Count the substitutions4. Calculate log-odds scores for all pairs5. You have the BLOSUMx matrix!
19
Some general notes BLOSUM62 has proved to work for a range of
similarities Other values (e.g. 30, 45 and 80) also used Today: BLOSUM used more than PAM
20
BLOSUM problems
Good and bad things Evolution not considered explicitly Known related proteins Assumes star-shaped phylogeny Based on conserved blocks / Local alignment
21
The matrix matters Example: BLAST using BLOSUM45 and
BLOSUM80 Note number of hits Note difference in scores
NCBI BLAST
>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM
22
Nucleotide scoring matricesNormally just simple 1/-1 cost scheme Problematic simplification Especially for structural sequences (e.g.
ncRNA) Consider evolution (PAM-like matrix) Use actual frequencies of nucleotides as
background Weigh transitions and transversions differently
23
Gap penalties
We can score pairs of nucleotides/amino acids What about indels and gaps?Two main strategiesLinear gap penalty: G(n)=a·n Penalize each gap the same (cf. last lecture)Affine gap penalty: G(n)=O+E·n More evolutionary sound Opening a gap costs more than extending it Standard today
24
Gaps: Things to consider
How large should O and E be? Depends on the scoring matrix used Normally good values are given Too large: No gaps created Too small: Too many gaps createdShould end gaps be penalized? Local alignment: No Global: Probably
25
HeuristicsSo, now we know all about scoring an alignment Let us use Smith-Waterman or Needleman-
Wunsch Against a database or a full-length genome?
Bad idea!
Instead: Heuristic method Guaranteed to give a good solution fast Not necessarily the optimal alignment
26
The BLAST algorithm
Basic Local Alignment Search ToolYou have all used it, but how does it work? Used to find statistically significant local
alignments between a query and a database Many versions (protein-protein, nucleotide-
nucleotide,nucleotide-protein …) Fast, widely used, good reason to know the
algorithm
27
BLAST overview
The six steps1. Filtering of low complexity regions (optional)2. Compile list of relevant words3. Scan database sequences4. Extend hits to HSPs5. Calculate E-value for significant hits6. Report hits and alignments
28
BLAST step 1
Filtering Some sequences contain low complexity
regions Give rise to many random hits Remember dotplots Filter out by replacing with Xs
29
BLAST step 2
Compiling word list Word length L=3 (protein) or L=11 (nucleotides) Choose score threshold T (usually 11) Find all words of length L in query sequence
One for each position Compare to all possible length L words
(8000/~4.2M) For each position in query, remove words below T Limited to appr. 50 words per position
30
The word listAPLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF…APL PLS LSA SAD ADQ DQA…
Word list, position 1 APL: APC,APS,APT,APE…Remove words w with score(APL,w)<T
31
BLAST step 3
Scan database Store words for each position in efficient
search tree Scan each sequence in database Remember exact word hits If query length is 100, scan for appr. 50·100
hits
32
Scanning
APL APA LPL IIAPLIGNESNAPAVQTLVGQLPLSHKARG…
Perfect match between word and database is a hit
33
BLAST step 4Extend hits (BLAST2) HSP: High-scoring Segment Pair Find two hits on same diagonal with distance ≤A Connect them (ungapped alignment) Extend using gaps, matches and mismatches Extension continues while score increases Stop when score drops X below highest score
Original BLAST Extend all single word hits (higher T needed)
34
BLAST extension Find diagonal hits Extend alignment
35
BLAST step 5
Calculate E-scores Compile list of HSPs scoring more than S Let x be the score of a HSP The probability of seeing this score by chance
P(score≥x) The expectation of seeing this score in the
databaseE≈1-e-P(score≥x)D
Rather complicated calculations The Karlin-Altschul equation (written in many ways)
36
BLAST step 6
Report the results List the hits (sorted by E-value) Graphical representation Show Smith-Waterman alignment of the HSPs
37
Using BLAST
Settings Query sequence
Sub-sequence Database important Conserved Domain search: Additional info What organisms to search Scoring matrix Normally leave the rest as defaults
But you can change, well, anything
38
Understanding BLASTOutput Database searched and query used Number of hits Color-coded diagram
Magnitude of score Relation between hits (if any)
Hits sorted by E-value Alignments Additional info by clicking a link
39
BLAST summary Heuristic local alignment Finds word matches Extends locally Might miss optimal solutions Fast Lower E-value better result Many hits between query and sequence
possible Remember: Use a proper scoring matrix!
40
Position Specific Scoring MatrixAnother brief detour: PSSM or Profile Useful to represent (part of) a multiple alignment Represents the information in a motif For each position: What are the frequencies of
the characters? (Nucleotides or amino acids) Frequencies can give the probabilities Find most probable hit to a PSSM in a sequence Compare to the background probability (i.e. the
overall frequencies) Pseudocounts to avoid 0 probabilities
41
PSSM example123456789---------AAGGTAAGTTGTGTGAGTCAGGTATACATGGTAACTTAGGTACTGTAGGTCATTACAGTCAGTCAGGTTGGATCCGTAAGTGAGGTAAAC
| 1 2 3 4 5 6 7 8 9
-|--------------------
A| 3 6 1 0 0 6 7 2 1
C| 2 2 1 0 0 2 1 1 2
G| 1 1 7 10 0 1 1 5 1
T| 4 1 1 0 10 1 1 2 6
Just divide by 10 to get probabilitiesAssume equal background
distributionP(A)=P(C)=P(G)=P(T)=0.25
42
PSSM score
Comparing sequence CAGGTAGTC to the PSSM Probability given the model0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016
Probability given the background0,259=0,000003815
log-odds score (bits)
72,5000003815,0
0002016,0loglog
bckg
model
P
P
43
PSI-BLAST
Position-Specific-Iterated BLAST Designed to find distant homologs More sensitive than BlastP Use PSSM instead of just sequence comparison
44
The PSI-BLAST algorithm
1. Perform standard BlastP search2. Create PSSM based on query and best hits3. Search database again for hits to the PSSM4. Incorporate new hits (i.e. update frequencies)5. Iterate (repeat) until convergence
End result: More distant homologs found Slower than standard BLAST (of course)
45
PSI-BLASTing a protein Output at first iteration, 2nd iteration, … When to stop?
In this case: 7 iterations
PSI-BLAST
>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM
46
Other BLAST versions MegaBLAST
Assumes high similarity, longer sequences BLAST2SEQUENCES
Blast on only two sequences (local alignments) PHI-BLAST
Pattern Hit Initiated BLAST – searches for motif
WU-BLAST BLAST from Washington University, not NCBI
47
BLAST variants Nucleotide vs nucleotide: blastn Protein vs protein: blastp Translated sequence vs protein database:
blastx Protein sequence vs translated database:
tblastn Translated sequence vs translated database:
tblastx Find divergent protein sequences: PSI-BLAST
48
The BLAT algorithm
BLAST-Like Alignment Tool Designed for the genome projects Local alignments between long
sequences Speed important! BLAST turned upside-down
49
What BLAT does
How does it work?1. Index the database (length 11 words)2. Find hits in the query
Opposite the BLAST strategy
3. Extend hits to HSPs
Useful when the database does not often change
Too time (and space) consuming for normal BLAST
50
The FASTA algorithm
Similar to BLAST … but different Fast All (since it works with all alphabets) Not as widely used as BLAST – but one of the
first Works in a step-wise fashion Locate word hits, extend using Smith-
Waterman
51
The steps in FASTA
1. Find word hits2. Score the hits and trim results3. Join regions of similarity4. Find the best alignment
52
FASTA step 1
Word hits Choose word size ktup (2 for protein, 4 or 6 for
DNA) Create two word lists: Query and database Find all words that occur in both Connect nearby hits directly (i.e. no gaps)
53
Word hits Find all hits Connect hits on same diagonal
54
FASTA step 2
Score and trim Only keep the best 10 segments from step 1 Re-evaluate all hits using PAM250 For each hit: Note the best score
55
Score and trim Keep 10 best hits Recalculate scores
56
FASTA step 3
Join regions All regions scoring over a threshold are kept Crudely join regions Add linear gap penalty for joining to diagonals Keep the best scoring rough alignments Removes unlikely similar regions
57
Join and penalize Keep the good regions Connect using gaps. Remove the low scoring
58
FASTA step 4
Optimize alignment Consider the good but crude alignments Reoptimize using Smith-Waterman Only a small part of the matrix is needed Banded version (much faster)
59
Optimize alignment Optimize crude alignment Use banded Smith-Waterman
60
FASTA results Histogram of scores Actual scores versus
expected scores Optimal alignments E-value representing
the probability of the hit
61
BLAST vs. FASTA
Differences between the two BLAST searches the neighbourhood FASTA looks for exact matches BLAST returns all the best hits in a sequence FASTA returns one hit per sequence BLAST is faster than FASTA FASTA produces better final alignment
62
Summary Introduction to Markov chains and HMMs Scoring matrices: PAM and BLOSUM
Larger PAM Greater distance Larger BLOSUM Greater identity
Gap penalties BLAST, PSI-BLAST etc. BLAT FASTA
Next time: Multiple alignment