Blosum Substitution Matrix Pab
Transcript of Blosum Substitution Matrix Pab
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
1
Introduction to Bioinformatics:Protein Informatics
7/23/03NHLBI Symposium: From Genome to Disease
Patricia C. BabbittUniversity of California, San Francisco
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
2
“ –mastics, –omens & omics”(courtesy of Cambridge Healthtech Institute: 50 & counting...)
• Biome• Celluome• Chronome• Clinome• Complexome• Crystallome• Cytome• Diagnome• Enzymome• Epigenome• Fluxome• Foldome• Functome• Genome• Glycome• Infectuome
• Immunome• Interactome• Localizome• Metabolome• Methylome• Microbiome• Morphome• Operome• ORFeome• Pathogenome• Peptidome• Pharmacogenomics• Phenome• Phylogenome• Physiome
• Promoterome• Proteome• Pseudogenome• Regulome• Resistome• Ribonome• Secretome• Signalome• Somatonome• Toxicome• Transcriptome• Translatome• Unknome• Vaccinomics• Variome
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
3
• deduction of function• tracing ancestral connections• understanding enzyme mechanisms• structural analysis of receptors, molecules involved
in cell signaling• identification of molecular surfaces in protein-
protein, protein-DNA interactions• protein engineering• clustering of families, superfamilies• metabolic computing/comparative genome analysis
Applications of Protein Informatics
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
4
Tools/Approaches for Protein Informatics
• database searching/pairwise alignments• pattern searching and motif analysis• multiple alignments• phylogenetic tree construction• sequence and structure comparison• comparative genomics• “metabolic computing”• transmembrane/2° structure prediction• 3D structure prediction/modeling• visualization• composition/pI/mass analysis
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
5
• Protein sequence analysis is more specific and lessnoisy than nucleic acid analysis due to the inherentdifferences in the message content of nucleic acid andamino acid codes
• 20-letter code vs 4-letter code, degeneracy of codonmessaging
• But searches for many functional genomicsexperiments must be done at nucleotide level...
Protein vs. nucleic acid sequenceanalysis?
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
6
Outline: Performing your own Analyses inProtein Informatics
• Ins and Outs of database searching– underlying assumptions– scoring, optimization, statistical significance, caveats
• Fasta, Blast & PsiBlast• Pattern searching & motif analysis• Pre-computed analyses for protein families using
sequence and structure information, motif databases
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
7
• The first and most common operation in proteininformatics...and the only way to access the information inlarge databases
• Primary tool for inference of homologous structure andfunction
• Improved algorithms to handle large databases quickly
• Provides an estimate of statistical significance
• Generates alignments
• Definitions of similarity can be tuned using differentscoring matrices and algorithm-specific parameters
Database searching
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
8
The underlying assumption used infunctional inference...
…requires comparison of sequences
Sequence Conservation
Structure Conservation
Function Conservation
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
9
Formalizing the Problem
• Given: two sequences that you want to align• Goal: find the best alignment that can be obtained by
sliding one sequence along the other• Requirements:
– a scheme for evaluating matches/mis-matches between anytwo characters
– a score for insertions/deletions– a method for optimization of the total score– a method for evaluating the significance of the alignment
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
10
• The degree of match between two letters can berepresented in a matrix
• Changing the matrix can change the alignment– Simplest: Identity (unitary) matrix– Better: Definitions of similarity based on inferences about chemical
or biological properties –Examples: PAM, Blosum, Gonnet matrices
• The score should have the form: pab /qa qb , where pab isthe probability that residue a is substituted by residue b,and qa and qb are the background probabilities for residuea and b respectively.
• Handling gaps remains an incompletely solved problem...
Scoring Systems
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
11
• Derived from the BLOCKS database, which, in turn isderived from the PROSITE library(see http://blocks.fhcrc.org/blocks/; http://www.expasy.ch/prosite/)
• BLOCKS generated from multiply aligned sequencesegments without gaps clustered at various similaritythresholds and corrected to avoid sampling bias
• Derived from data representing highly conservedsequence segments from divergent proteins rather thandata based on very similar sequences (as with PAMmatrices)
BLOSUM (BLOcks SUbstitution) Matrices
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
12
• Many sequences from aligned families are used togenerate the matrices
• Sequences identical at >X% are eliminated to avoidbias from proteins over-represented in the database
• Specific matrices refer to these clustering cut-offs, i.e.,BLOSUM62 reflects observed substitutions betweensegments <62% identical
• These matrices have become the default scoringschemes used at most primary internet search sites
• Different matrices can make a difference to yourresults!
*adapted from Ewens & Grant, Statistical Methods in Bionformatics
Derivation of BLOSUM matrices*
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
13
• scoring matrices are tailored to degree of divergenceand may require a specific query length for optimalperformance*
*adapted from information available at the NCBI Blast web site
Query Length Substitution Matrix
<35 PAM-30
35-50 PAM-70
50-85 BLOSUM-80
>85 BLOSUM-62
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
14
Scoring and optimization
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
15
SEQUENCEHOMOLOGS• •E • • •Q •U •E • • •N • •C •E • • •AN •AL •O •G• •
• Dot matrix plots: a simple description of alignmentoperations illustrating types of relationships betweena sequence pair
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
16
• The signal-to-noise ratio can be improved usingfiltering techniques designed to minimize thecomposition- dependent background
• Example of common filters: over-lapping, fixed-length"windows" for sequence comparison
• To be counted, a comparison must achieve aminimum threshold score summed over the window,derived empirically or from a statistical or evolutionarymodel of sequence similarity
• The window size and minimum threshold score (oftentermed "stringency") at which the score is counted canbe user-defined
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
17
Seq1 = SEQUENCEHOMOLOGSeq2 = SEQUENCEANALOGWindow = 7, Stringency = 42% (3/7 matches)
SEQUENCSEQUENCEANALOG (7/7 matches)
SEQUENCSEQUENCEANALOG (0/7 matches)
...
CEHOMOLSEQUENCEANALOG (2/7 matches)
...
HOMOLOG (3/7 matches)SEQUENCEANALOG
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
18
Window = 30; Stringency = 2
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
19
Window = 30; Stringency = 11
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
20
• To measure the local similarity between 2 sequences, scorescan be used in the matrix instead of dots for a sliding windowcomparison– Summing the identities/similarities at each position– For a window of 5 residues and storing the score in the position
corresponding to the center of the window:
1P R I M E511-1-2+0+4 = +21S E Q U E N C E A N A L Y S I S P R I M E R21 . . .
1P R I M E5 16+6+5+6+4 = +271S E Q U E N C E A N A L Y S I S P R I M E R21
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
21
Statistical Significance
• A good way to determine if the alignment score hasstatistical meaning is to compare it with the scoregenerated from the alignment of two randomsequences
• A model of ‘random’ sequences is needed. Thesimplest model chooses the amino acid residues in asequence independently, with backgroundprobabilities
• For an un-gapped alignment, the score of a match toa random sequence is the sum of many similarrandom variables, the sum can be approximated by anormal distribution.
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
22
– Comparing a query sequence to a set of random sequences of uniform length results inscores that obey an extreme value distribution rather than a normal distribution, e.g.,can lead to overestimation of an alignment’s significance (see Altschul et al, 1994)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
23
• For database searches, the ONLY criteriaavailable to judge the likelihood of a structural orevolutionary relationship between 2 sequences isan estimate of statistical significance
• Statistical significance and biological significanceare NOT necessarily the same
Caveats
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
24
Query= /phosphonatase/phosSt.gcg (255 letters) (10/20/99/pcb)Database: /mol/seq/blast/db/swissprot 78,725 sequences; 28,368,147 total letters!
Score ESequences producing significant alignments: (bits) Value
sp|O06995|PGMB_BACSU Begin: 93 End: 204 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 38 0.020sp|P31467|YIEH_ECOLI Begin: 1 End: 180 HYPOTHETICAL 24.7 KD PROTEIN IN TNAB-BGLB I... 36 0.10sp|O14165|YDX1_SCHPO Begin: 34 End: 201 HYPOTHETICAL 27.1 KD PROTEIN C4C5.01 IN CHR... 31 2.6sp|P41277|GPP1_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 1 30 4.4sp|Q39565|DYHB_CHLRE Begin: 3911 End: 4032 DYNEIN BETA CHAIN, FLAGELLAR OUTER ARM 29 7.6sp|P77625|YFBT_ECOLI Begin: 143 End: 187 HYPOTHETICAL 23.7 KD PROTEIN IN LRHA-ACKA I... 29 10.0sp|Q40297|FCPA_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P40853|GPHP_ALCEU Begin: 94 End: 188 PHOSPHOGLYCOLATE PHOSPHATASE, PLASMID (PGP) 29 13sp|Q40296|FCPB_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P52183|ANNU_SCHAM Begin: 119 End: 168 ANNULIN (PROTEIN-GLUTAMINE GAMMA-GLUTAMYLTR... 29 13sp|P40106|GPP2_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 2 28 17sp|P37934|MAY3_SCHCO Begin: 435 End: 552 MATING-TYPE PROTEIN A-ALPHA Y3 27 29sp|O06219|MURE_MYCTU Begin: 255 End: 371 UDP-N-ACETYLMURAMOYLALANYL-D-GLUTAMATE--2,6... 27 29sp|P08419|EL2_PIG Begin: 182 End: 245 ELASTASE 2 PRECURSO 27 38sp|Q11034|Y07S_MYCTU Begin: 163 End: 218 HYPOTHETICAL 69.5 KD PROTEIN CY02B10.28C 27 38sp|P00577|RPOC_ECOLI Begin: 1290 End: 1401 DNA-DIRECTED RNA POLYMERASE BETA' CHAIN (T 27 38sp|P32662|GPH_ECOLI Begin: 20 End: 49 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 38sp|P32662|GPH_ECOLI Begin: 116 End: 224 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 28sp|P32282|RIR1_BPT4 Begin: 239 End: 266 RIBONUCLEOSIDE-DIPHOSPHATE REDUCTASE ALPHA C... 27 50sp|P17346|LEC2_MEGRO Begin: 36 End: 121 LECTIN BRA-2 27 50sp|P54947|YXEH_BACSU Begin: 24 End: 51 HYPOTHETICAL 30.2 KD PROTEIN IN IDH-DEOR IN... 27 50sp|P77366|PGMB_ECOLI Begin: 95 End: 190 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 27 50sp|P30139|THIG_ECOLI Begin: 43 End: 79 THIG PROTEIN 27 50sp|P95649|CBBY_RHOSH Begin: 96 End: 189 CBBY PROTEIN 27 50sp|Q43154|GSHC_SPIOL Begin: 228 End: 327 GLUTATHIONE REDUCTASE, CHLOROPLAST PRECURSO... 26 66sp|P34132|NT6A_HUMAN Begin: 191 End: 215 NEUROTROPHIN-6 ALPHA (NT-6 ALPHA) 26 66sp|P34134|NT6G_HUMAN Begin: 115 End: 144 NEUROTROPHIN-6 GAMMA (NT-6 GAMMA) 26 66sp|P95650|GPH_RHOSH Begin: 48 End: 114 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 26 66
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
25
0
50
100
150
200
0 200 400 600 800 1000
chan
ges/
100
amin
o ac
ids
millions of years since divergence
Hemoglobin
Fibrinopeptides
Cytochrome C
• Different proteins evolve at different rates
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
26
• Different domains within a single proteinevolve at different rates
C-peptide
B-chain C-peptide A-chain
A-chain
B-chain
r = 0.13 x 10-9/site/yearr = 0.97 x 10-9/site/year
Proinsulin
Mature insulin
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
27
• "Fast" search algorithm generates global alignments,allows gaps(see http://www.ebi.ac.uk/fasta33/)
• Extensively updated since first release– added statistical analysis– multiple variants available– FASTA3 is the current implementation
FASTA
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
28
• FASTA Compares protein vs protein or DNA vs DNA
• FASTX/FASTY Compares DNA query to proteinsequence db, DNA translated in 3 forward (or reverse)frames; allows frameshifts
• TFASTX Compares protein query vs DNA sequence ordb, translated in all 6 reading frames; no accommodationfor introns
• FASTS Compares a set of short peptide fragmentsderived from mass spectrometric proteomic analysis vsprotein or DNA db
FASTA flavors(see http://fasta.bioch.virginia.edu/)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
29
• Original "fast" search algorithm generates localalignments without gaps (Blast 1.4)
• Newer versions (Blast 2.0x) accommodates gaps
• Access at NCBI and other sites:http://www.ncbi.nlm.nih.gov/BLAST/
• Documentation– Manual: http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html– FACS: http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html– Tutorial: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
30
BLAST flavors
• blastp compares an amino acid query sequence against a proteinsequence database
• blastn compares a nucleotide query sequence against a nucleotidesequence database
• blastx compares the six-frame conceptual translation products ofa nucleotide query sequence (both strands) against a proteinsequence database
• tblastn compares a protein query sequence against a nucleotidesequence database dynamically translated in all six readingframes (both strands)
• tblastx compares the six-frame translations of a nucleotide querysequence against the six-frame translations of a nucleotide sequencedatabase
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
31
• These methods are so widely used because theyreally are that good...
• BUT, there are some disadvantages:– Loss of sub-optimal alignments– Pairwise comparisons limit information content– Many biologically significant relationships may be lost in the
"noise," i.e., hits that are not statistically significant
• BLAST is not “better” than FASTA
Some Generalities about Fasta, Blast
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
32
• Generalizes BLAST algorithm to use a position-specific score matrix in place of a query sequence andassociated substitution matrix for searching thedatabases
• Position-specific score matrix generated from theoutput of a gapped Blast search, i.e., uses a profile ormotif defined in the initial Blast search in place of asingle query sequence and matrix for subsequentsearches of the database
• Results in a database search “tuned” to the specificsequence characteristics of interest
Psi-Blast: Extending our reach...
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
33
• Constructs a multiple alignment from a Gapped Blastsearch and generates a profile from any significantlocal alignments found
• The profile is compared to the protein database andPSI-BLAST estimates the statistical significance ofthe local alignments found, using "significant" hits toextend the profile for the next round
• PSI-BLAST iterates step 2 an arbitrary number oftimes or until convergence
*Adapted from the PSI-BLAST tutorial at NCBI
Steps in a Psi-Blast search*
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
34
• Access at http://www.ncbi.nlm.nih.gov/BLAST/
• Tutorial athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-2.html
• A short explanation of PSI-BLAST statistics athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-3.html
• See also:Park J et al “Sequence comparisons using multiplesequences detect three times as many remote homologs as pairwisemethods,” JMB 284:1201-10, 1998
PSI-BLAST information on the web
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
35
Other alternatives
• Many, many other DB searching algorithms areavailable– Smith-Waterman– Methods based on probabilistic models/profiles, e.g., Hidden
Markov models– Motif searching
• Or, you can use (or start with) pre-computedanalyses of protein families
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
36
• Identification of very distant homologs• May point to important functional units in a
protein• Can be used to "anchor" a multiple alignment• Databases of motifs can be used to develop other
informatics applications
Example: BLOCKS Æ Blosum matrices
Why do motif analysis?
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
37
Motif analysis
• Focuses on conserved patterns among two or moresequences to determine relationships
• Many variants of motif searching available– Consensus-based, e.g., Prosite
http://expasy.nhri.org.tw/prosite/– Manually annotated motifs, distant relationships, e.g.,
PRINTShttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
– Statistical, e.g., MEME (Multiple EM for Motif Elicitation)http://meme.sdsc.edu/meme/website/
– Database searching, e.g., PHI-BLASThttp://www.ncbi.nlm.nih.gov/BLAST/
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
38
Meme & Mast
• Meme: motif discovery toolhttp://meme.sdsc.edu/meme/website/intro.html– motifs represented as position-dependent letter-probability
matrices which describe the probability of each possibleletter at each position in the pattern
– output can be converted to BLOCKS which can then beconverted to PSSMs (position-specific scoring matrices)
• Mast: database searching tool using one or moremotifs as queries– provides a match score for each sequence in the database
compared with each of the motifs in the group of motifsprovided represented as p-values
– provides probable order and spacing of occurrences of themotifs in the sequence hits
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
39
Some pre-calculated motif/family compilations
• Prosite: Protein families/domains showing biologicallyimportant patterns (1637 different patterns, rules andprofiles/matrices as of 6/03) http://us.expasy.org/prosite/
• Pfam: Multiple sequence alignments and HMMs formany protein domains (5724 families as of 5/03)http://pfam.wustl.edu/
• Prints: Conserved motifs characterizing proteinfamilies (1800 entries, encoding 10,931 individualmotifs as of 4/03) http://bioinf.man.ac.uk/dbbrowser/PRINTS/
• Compilation of specific protein family websites at theMRC http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-family.html
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
40
Laboratory Exercises & Resources fromBaygenomics
http://baygenomics.ucsf.edu/PGAConference2003/
• Using the LDL receptor as an example– DB searching– TMD prediction– Prosite, Pfam, Prints, Motif analysis– Multiple alignment generation and interpretation– Tree building/visualization– 2° structure/TMD prediction– 3D structure visualization
• Part of a 2-day hands-on workshop (& and onlineversion)– extensive help files– detailed answer keys