School B&I TCD Bioinformatics
-
Upload
uriah-jimenez -
Category
Documents
-
view
28 -
download
0
description
Transcript of School B&I TCD Bioinformatics
![Page 1: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/1.jpg)
School B&I TCD Bioinformatics
Database homology searching
May 2010
![Page 2: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/2.jpg)
Why search a Database
• To find homologous sequences to your unknown to determine function
• To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA)
• To find the mouse or E.coli homolog of your gene of interest
• To find genes in a newly sequenced genome• To predict 3-D structure (blast vs PDB)
![Page 3: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/3.jpg)
BLAST
• For many molecular biologists
• a Blast search – with a DNA sequence – at NCBI – accepting default parameters
• IS bioinformatics– 70% of searches at NCBI at blastN
![Page 4: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/4.jpg)
BLAST
• Basic local alignment search tool• More or less unchanged for 20 years• Extraordinarily quick:
– Submit seq via satellite to Bethesda MD– Crunch thro 400 GB of sequence– Return results via satellite– ALL in 30 seconds
• Available at other sites (EBI, ExPaSy etc.)• with better response times, fewer conditions
![Page 5: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/5.jpg)
NCBI penalties
• 1st request: current time• 2nd request: current time + 60 seconds• 3rd request: current time + 120 seconds• 4th request: current time + 180 seconds• 5th request: current time + 240 second
• DON’T submit multiple simultaneous queries
![Page 6: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/6.jpg)
Reliable servers
• http://www.ch.embnet.org/software/bBLAST.html • http://www.ch.embnet.org/software/aBLAST.html
– Basic and advanced SwissBlast
• http://www.ebi.ac.uk/blast2/index.html – EBI (Cambridge, UK) NCBI Blast
• http://www.ncbi.nlm.nih.gov/blast/– Not the only Blast server!
![Page 7: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/7.jpg)
BLAST varieties• blastn: searches a DNA sequence against a DNA database like
EMBL, Genbank, or dbEST. also megablast for exact matches• blastp: searches a protein sequence against a protein database
such as UniProt, or "nr" a non-redundant database which ideally contains one copy of every available sequence.
Then you have:• blastx: searches a DNA sequence (translated in all six reading
frames) against a protein database.• tblastn: searches a protein sequence against a DNA database
(translated in all six reading frames) – essential for searching EST databases.
and in the interests of completeness there is:• tblastx: searches a DNA sequence (translated in all six reading
frames) against a DNA database (translated in all six reading frames).
finally• Psi-blast an iterative process using position specific subst matrix
PSSM to find distantly related sequences
![Page 8: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/8.jpg)
Algorithm
• Five steps– Break the query seq into short “words” – typically 3
consecutive residues for protein– Search the database for (nearly) exact matches to
these words– Extend match “hits” out on either side until score
stops going up – these are HSPs (high scoring segment pairs)
– Sort the HSPs by some “optimum” criterion– Significant hits are then formally scored, aligned and
displayed
![Page 9: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/9.jpg)
Alternatives to BLAST
• FASTA– A little slower than Blast– More sensitive (so recommended) for DNA to
DNA searches
• Smith-Waterman (Blitz)– Much slower than Blast (20x slower)– Much more sensitive
• But Blast is standard because it gives a “good enough” answer quickly.
![Page 10: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/10.jpg)
Blast Blast at NCBI
Parallel search in Conserved domain DB
Paste your query sequence here
![Page 11: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/11.jpg)
Input sequence parameters
Optional parameters to change include• Parameters that alter the hits found
– Database searched– Word size– Substitution matrix– Gap penalties– Low complexity masking
• Parameters that alter the results delivered– Expectation cut-off– Limit by organism or taxonomic group– Number of hits reported– Number of alignments shown
![Page 12: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/12.jpg)
Database
• If query is a coding gene: translate and search protein database
• Search PDB if you want a 3-D structure• Search NR if you want any hit• Search UniProt to know what the hits are• Search dbEST to know if your sequence is expressed
• UniProt90: no seq is more than 90% ident to any other (for an uncluttered tree) also UniProt50
![Page 13: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/13.jpg)
Word size
• Default at NCBI Blast– 11 for DNA; option of 7 and 5– 3 for protein; option of 2
• Increasing wordsize will speed search
… but will lose sensitivity– It will miss some useful but distant hits
• Decreasing wordsize will be more sensitive … but take longer
![Page 14: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/14.jpg)
Substitution matrix
• By default BLAST uses BLOSUM62
• Can change this– Blosum90 mirrors changes in closely related
sequences– Blosum30 changes in distant sequences
• Should run 3 blast searches with different substitution matrices.
![Page 15: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/15.jpg)
Substitution matricesTop left part of a BLOSUM 90 matrix A R N D C Q E G H I LA 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3N -2 -1 7 1 -4 0 -1 -1 0 -4 -4D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2Q -1 1 0 -1 -4 7 2 -3 1 -4 -3E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5H -2 0 0 -2 -5 1 -1 -3 8 -4 -4I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5
Changes 30-62-90 not linear
Positive off-diagonals aresimilar
![Page 16: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/16.jpg)
Reality of matrix change
• Query: GHDEICI• GH + C • Sbjct: GHACNCG
– Scores 39 with Blosum30; 5 with Blosum90
• Query: HEQCRLEN• +E LEN• Sbjct: QENAHLEN
– Scores 19 with Blosum30; 24 with Blosum90
• So GHDEICIHEQCRLEN will find different hits
![Page 17: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/17.jpg)
Matrix families
• PAM – point accepted mutation– Margaret Dayhoff 1970s (before Genbank)
• BLOSUM – based on aligned blocks– Henikoff and Henikoff 1992– Blosum 62 default (based on aligned seqs
62% identical – don’t ask why 62: it works)
• Low PAM = High Blosum– PAM 250 == Blosum 30– PAM 30 == Blosum 90
![Page 18: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/18.jpg)
Gap penalty
• Original Blast was gap-free– Because gapped aligns much more CPU
• Now can insert “affine” gaps– Gap open 10; gap extend 1
• Raise gap penalty to discourage gaps– Preferentially gets closely related hits
• Lower gap penalty for sensitive search for distant relatives
![Page 19: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/19.jpg)
Low Complexity Masking• Seg masks low-complexity regions
– “too many” serines, prolines etc.• What a masked sequence looks like:>P04729 Wheat gamma gliadinMKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQQPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQIPIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQSRYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLGQQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTGVGAY*and after low complexity masking:
>P04729 SEG low-complexity maskedMKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTGVGAY*
• Xnu masks repeats (from database of common repeats)
![Page 20: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/20.jpg)
Masking• Check if masking on/off by default (differs at
sites)• Run 2 searches with masking on, masking off
ANNYway• Masking by hand:
– If you know about the DNA binding domain already• Which is really common and will occupy the top 100 hits
against any database– Replace the region with Xs– Re-run blast to find other motifs/domains/information– NCBI blast allows select subsequences
• DUST masks low info DNA sequences– Like polyA tails
![Page 21: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/21.jpg)
Advanced
![Page 22: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/22.jpg)
Expectation cut-off• E value
– Expected number of chance hits• Given the database searched• Query HSP length
• Related to probability• Default is 10: “expect to find 10 hits as
good as this by chance alone”• E values less than 10-4 unreliable • So different from p < 0.05 • For short seqs crank E up to 100 or 1000
![Page 23: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/23.jpg)
Data deluge – hits delivered
• Default suits general purpose
• V = 50 num of “hits” one line descriptions
• B = 50 num of alignments between query and hit which are displayed
![Page 24: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/24.jpg)
Swiss Blast
![Page 25: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/25.jpg)
EBI blast
![Page 26: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/26.jpg)
Limit by taxon/organism
Also search at www.ensembl.org for “genomed” organisms
![Page 27: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/27.jpg)
Database choice
NR for getting aNNy hitswissprot or refseq for getting hits that have annotationMonth for recent hits
![Page 28: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/28.jpg)
The output
• Is in five parts
1. Admin – date, size of database, id of query
2. Graphic display of query and hits
3. List of hits with links to database and to…
4. …alignments (may be > 1 HSP per hit)
5. More admin, including errors/warnings
![Page 29: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/29.jpg)
Notes!
• Record the top admin stuff. Your search will be different – Next week– On a different server– With different DB– With different input parameters
• Mouse-over any graphic display– Shows domain structure– Shows if hit is global or local
• Read the bottom admin stuff
![Page 30: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/30.jpg)
Blast outputsp|P06471|HOR3_HORVU B3-HORDEIN Length = 264 Score = 62.5 bits (149), Expect = 1e-09 Identities = 32/63 (50%), Positives = 38/63 (59%)Query: 131 LNPCARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIY LNPCARSQM R+EA+YSbjct: 111 LNPCARSQMLQQSSCHVLQQQCCQQLPQIPEQLRHEAVY Query: 191 SII 193 SI+Sbjct: 171 SIV 173
Low-complexity masked regionWhat sort of residues masked inThe query sequence??
May be more HSPs for same hitIf big deletion in either seq
![Page 31: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/31.jpg)
General blast protocol1. Find server, choose DB, paste seq, GO
2. Rerun search with/out masking
3. Rerun search with two diff subs matrix
4. 2 x 3 for six searches
5. If top N hits all same family/domain then XXX this region and resubmit
6. LOOK at the results; esp strange ones
7. Limit results by organism
![Page 32: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/32.jpg)
Blast notes
• Twilight zone <25% protein <70% NA• Because two sequences give high blast scores
to a third, doesn’t mean they are related– NNNNDOMAINANNNNDOMAINBNNN– NNNNDAMIANANNNNNNNNNNNNNN– NNNNNNNNNNNNNNNDIMIAMBNNN
• E-value vs % ID. >10-4 unreliable• Bit score <50 unreliable
![Page 33: School B&I TCD Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051416/56812b48550346895d8f660a/html5/thumbnails/33.jpg)
PSI-BLAST
• Uses a PSSM rather than BloSum/PAM
• Iterative …– Can find very distant relatives– …so deep insight
• BUT can iterate off with the fairies