B ioinform atics Alignment of biological sequences - databases and software
description
Transcript of B ioinform atics Alignment of biological sequences - databases and software
BBioinformioinformaticsatics
Alignment of biological sequencesAlignment of biological sequences - - databases and softwaredatabases and software
LU, 204,LU, 204, Juris ViJuris Viksnaksna
Praksē lietotās virkņu salīdzināšanas metodes
Smith-Waterman algoritms (1981)Lokālā salīdzināšanaLineāras gap penaltiesLieto substitūciju matricas
Vai laiks (n2) ir praktiski pieņemams?
Proteīna garums - ap 100 aminoskābju...Divu proteīnu salīdzināšana: 10000 op. (0.01 sek, pie 1 MHz)Proteīnu datubāze - 1000000 ieraksti...Meklēšana datubāzē: 108 op, 100 sekDivu datubāzu salīdzināšana - 1016 op, 25 gadi :(
Heiristiskas metodes - FASTA
• Hash table of short words in the query sequence• Go through DB and look for matches in the query hash
(linear in size of DB)• K-tuple determines word size (k-tup 1 is single aa)• Lipman & Pearson 1985
VLICTAVLMVLICTAAAVLICTMSDFFD
VLICT = _
[Adapted from M.Gerstein]
The FASTA Algorithm
4 steps:
• use lookup table to find all identities at least ktup long, find regions of identities (Fig.1A)
• rescan 10 regions (diagonals) with highest density of identities using PAM250 (Fig.1B)
• join regions if possible without decreasing score below threshold (Fig.1C)
• rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment) (Fig.1D)
FASTA Parameters
• ktup = 2 for proteins, 6 for DNA
• init1 Score after rescanning with PAM250 (or other)
• initn Score after joining regions
• opt Score after Smith-Waterman
[Fig.1 of Pearson and Lipman 1988]
FASTA algoritms
[Adapted from D Brutlag]
FASTA algoritms
FASTA histogrammas
• Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410
• Indexes query and DB• Starts with all overlapping words from query• Calculates “neighborhood” of each word using PAM matrix
and probability threshold matrix and probability threshold• Looks up all words and neighbors from query in database
index• Extends High Scoring Pairs (HSPs) left and right to maximal
length• Finds Maximal Segment Pairs (MSPs) between query and
database• Blast 1 does not permit gaps in alignments
[Adapted from M.Gerstein]
Heiristiskas metodes - Blast
BLAST algoritms• Keyword search of all words of length w from
the in the query of length n in database of length m with score above threshold– w = 11 for nucleotide queries, 3 for proteins
• Do local alignment extension for each found keyword– Extend result until longest match above threshold is achieved
• Running time O(nm)
[Adapted from S.Daudenarde]
BLAST algoritms
Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD
keyword
GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11
neighborhoodscore threshold
(T = 13)
Neighborhoodwords
High-scoring Pair (HSP)
extension
[Adapted from S.Daudenarde]
Original BLAST• Dictionary
– All words of length w
• Alignment– Ungapped extensions until score falls below some
threshold
• Output– All local alignments with score > statistical threshold
[Adapted from S.Daudenarde]
Original BLAST: Example
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
A
w = 4
Exact keyword match of GGTC
Extend diagonals with mismatches until score is under 50%
Output result:
GTAAGGTCCGTTAGGTCC
From lectures by Serafim Batzoglou (Stanford)
Gapped BLAST: Example
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
A• Original BLAST exact keyword
search, THEN:
• Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments
• Output result:
GTAAGGTCC-AGTGTTAGGTCCTAGT
From lectures by Serafim Batzoglou (Stanford)
BLAST - Programmasblastpcompares an amino acid query sequence against a protein sequence database
blastncompares a nucleotide query sequence against a nucleotide sequence database
blastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence database
tblastncompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
tblastxcompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive
PSI-BLAST and transitive sequence comparison
Rezultātu vizualizācija (dotplots)
Proteīnu virkņu formāti - FASTA>sp|P00156|CYB_HUMAN Cytochrome b - Homo sapiens (Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILLLATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLLSLMTLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILILAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTIIGQVASVLYFTTILILMPTISLIENKMLKWA
Proteīnu virkņu formāti - SwissProtID CYB_HUMAN STANDARD; PRT; 380 AA.
AC P00156; Q34786; Q8HBR6; Q8HNQ0; Q8HNQ1; Q8HNQ9; Q8HNR4; Q8HNR7; AC Q8W7V8; Q8WCV9; Q8WCY2; Q8WCY7; Q8WCY8; Q9B1A6; Q9B1B6; Q9B1B8; AC Q9B2X7; Q9B2X9; Q9B2Y3; Q9B2Z0; Q9B2Z4; Q9T6H6; Q9T9Y0; Q9TEH4; DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot. DT 21-JUL-1986, sequence version 1. DT 19-SEP-2006, entry version 75. DE Cytochrome b. GN Name=MT-CYB; Synonyms=COB, CYTB, MTCYB; OS Homo sapiens (Human). OG Mitochondrion. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP NUCLEOTIDE SEQUENCE. RC TISSUE=Placenta;RX MEDLINE=81173052; PubMed=7219534 [NCBI, ExPASy, EBI, Israel]; RA Anderson S., Bankier A.T., Barrell B.G., de Bruijn M.H.L., RA Coulson A.R., Drouin J., Eperon I.C., Nierlich D.P., Roe B.A., RA Sanger F., Schreier P.H., Smith A.J.H., Staden R., Young I.G.; RT "Sequence and organization of the human mitochondrial genome."; RL Nature 290:457-465(1981).
Proteīnu virkņu formāti - SwissProtFT VARIANT 360 360 T -> A.
FT /FTId=VAR_013667. FT VARIANT 368 368 T -> I. FT /FTId=VAR_013668. SQ SEQUENCE 380 AA; 42730 MW; 90097B07FF2C1FD8 CRC64; MTPMRKINPL MKLINHSFID LPTPSNISAW WNFGSLLGAC LILQITTGLF LAMHYSPDAS TAFSSIAHIT RDVNYGWIIR YLHANGASMF FICLFLHIGR GLYYGSFLYS ETWNIGIILL LATMATAFMG YVLPWGQMSF WGATVITNLL SAIPYIGTDL VQWIWGGYSV DSPTLTRFFT FHFILPFIIA ALATLHLLFL HETGSNNPLG ITSHSDKITF HPYYTIKDAL GLLLFLLSLM TLTLFSPDLL GDPDNYTLAN PLNTPPHIKP EWYFLFAYTI LRSVPNKLGG VLALLLSILI LAMIPILHMS KQQSMMFRPL SQSLYWLLAA DLLILTWIGG QPVSYPFTII GQVASVLYFT TILILMPTIS LIENKMLKWA
Proteīnu virkņu formāti - NiceProt
Proteīnu virkņu datubāzes - UniProt
http://www.expasy.uniprot.org/
Proteīnu virkņu datubāzes - PIR
http://pir.georgetown.edu/
Proteīnu virkņu datubāzes - NRL-3D :)
Proteīnu struktūru datubāzes - PDB
http://www.pdb.org
Salīdzināšanas programmas - BLAST http://www.ncbi.nlm.nih.gov/BLAST/
http://www.ebi.ac.uk/blastall/
Salīdzināšanas programmas - FASTA http://fasta.bioch.virginia.edu/
http://www.ebi.ac.uk/fasta33/
Salīdzināšanas programmas - Smith-Waterman
http://www.ebi.ac.uk/MPsrch/
Salīdzinājumu novērtēšana
S = Total Score
S(i,j) = similarity matrix score for aligning i and j
Sum is carried out over all aligned i and j
n = number of gaps (assuming no gap ext. penalty)
G = gap penalty
nGjiSSji
,
),(
Simplest score (for identity matrix) is S = # matches
What does a Score of 10 mean? What is the Right Cutoff?
[Adapted from M.Gerstein]
Assessing sequence similarity
• Need to know how strong an alignment can be expected from chance alone
• “Chance” is the comparison of– Real but non-homologous sequences– Real sequences that are shuffled to preserve compositional
properties– Sequences that are generated randomly based upon a DNA or
protein sequence model (favored)
[Adapted from S.Daudenarde]
Model Random Sequence
• Necessary to evaluate the score of a match
• Take into account background– Adjust for G+C content
– Poly-A tails
– “Junk” sequences
– Codon bias
[Adapted from S.Daudenarde]
E-ValuesE-Value
Expectation value cutoff. An E-Value of 1 means that one would expect to find at most 1 alignment with a score as high as reported in a search of the given database. Setting the E-value higher will report more sequence matches
Resp., jo mazāka, jo labāk...
Rēķina apmēram šādi:
E = mn2-S’, kur m,n - virkņu garumi,S' - rēķina, piem., šādi: S’ = S–ln(K)
ln(2)
P-Values
P-Value
Varbūtība, ka šāds (vai labāks) rezultāts tiks iegūts "nejaušam"proteīnam
Arī, jo mazāka, jo labāk...
P-Values
• According to Poison’s distribution, the probability of finding b HSPs with a score S is given by:– (e-EEb)/b!
• For b = 0, that chance is:– e-E
• Thus the probability of finding at least one such HSP is:– P = 1 – e-E
[Adapted from S.Daudenarde]
P-Values• P(s > S) = .01
– P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time
• Likewise for P=.001 and so on.
[Adapted from M.Gerstein]
Z-score
General Protein Search Principles
• Choose between local or global search algorithms
• Use most sensitive search algorithm available
• Original BLAST for no gaps• Smith-Waterman for most
sensitivity• FASTA with k-tuple 1 is a good
compromise• Gapped BLAST for well delimited
regions• PSI-BLAST for families• Initially BLOSUM62 and default
gap penalties
• If no significant results, use BLOSUM30 and lower gap penalties
• FASTA cutoff of .01
• Blast cutoff of .0001
• Examine results between exp. 0.05 and 10 for biological significance
• Ensure expected score is negative
• Beware of hits on long sequences or hits with unusual aa composition
• Reevaluate results of borderline significance using limited query region
• Segment long queries 300 amino acids
• Segment around known motifs
(some text adapted from D Brutlag)
Uzskatīsim, ka proteīni ir homologi, ja “līdzība” parsniedz kaut kadu slieksni t
true positives (tp) - s(a,b) t un a, b ir homologifalse positives (fp) - s(a,b) t un a, b nav homologi
true negatives (tn) - s(a,b) < t un a, b nav homologifalse negatives (fn) - s(a,b) < t un a, b ir homologi
Sensitivity = tp/n = tp/(tp+fn)Specificity = tn/n = tn(tn+fp)
ROC diagrammas
ROC diagrammas
ROC diagrammas
Error rate (fraction of the “statements” that are false positives)
Coverage (roughly, fraction of sequences that one confidently “says something” about)
100%
100%Different score thresholds
Two “methods” (red is more effective)
Thresh=30
Thresh=20
Thresh=10
[sensitivity=tp/n=tp/(tp+fn)]
[Specificity = tn/n =tn/(tn+fp)]error rate = 1-specificity = fp/n
[Adapted from M.Gerstein]