BLAST
-
Upload
avrilcoghlan -
Category
Education
-
view
1.588 -
download
1
description
Transcript of BLAST
BLAST
Dr Avril [email protected]
Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
• Sequence alignment has many usesSequence assembly – genome sequences are assembled by using sequence alignment methods to find overlaps between many short pieces of DNAGene finding – alignment of whole genome sequences from two or more species can aid in discovery of previously unknown genes Sequence divergence – the amount of sequence similarity between sequences (which can be calculated from a sequence alignment) tells us how closely they are relatedDatabase searching – we use fast sequence alignment methods (eg. BLAST) to determine whether a protein/DNA sequence is similar to any known sequence Prediction of function – if we know the function of a sequence, we can predict the function of similar sequences identified by database searching (eg. for fruitfly eyeless gene)
• The number of DNA and protein sequences in public databases is very largeNCBI Protein database has ~38,500,000 protein sequences
• Searching a database involves aligning the query sequence to each sequence in the database, to find significant local alignments
BLAST
VIVALASVEGAS
Align A to each B
TARQDEFGGADatabase sequences B
VIVADAVISIRYDDEQAKMKQIRALQPSTQREGHQIALMPLKMVQRRASTILHGGQWLC
etc. etc.
eg. predicted protein from a candidate gene (ORF)
Query sequence A
Database
BLAST
• Needleman-Wunsch & Smith-Waterman are too slow for searching databases
• Fast ‘heuristic’ methods are used eg. BLASTN.B. ‘heuristic’ means they’re not guaranteed to find the best solution (best alignment here), but they work okay
• BLAST was developed by Stephen Altschul & colleagues at NCBI in 1990NCBI = National Center for Biotechnology Information (USA)BLAST = ‘Basic Local Alignment Search Tool’
• The most used bioinformatics programAltschul’s 1997 paper on BLAST has been cited >26,000 times!
There are two main steps in BLAST It makes a list of words of length k (eg. k = 3 amino acids) in the query sequenceIt then looks for database sequences that share these wordsDatabase sequences that share many words with the query are used for the final alignments (step )
etc.
1
2
HIRTHIQLEQEWDSALIAAIQLE
PDADSTESKLAKAIQLFVCTTILCYTSKLADS
Database sequence 1
Database sequence 2
Doesn’t share wordsShareswords
ADSKLWLLFKSLMNDKPFKKADFFADSDSKSKL...
Query sequence
3-bp words
For a database sequence that shares many words with the query, it makes an alignmentA local alignment of the query & the database sequenceThe alignment contains the initial region with shared wordsHowever, the alignment may extend beyond that initial region
• BLAST finds islands of similarity between sequencesGiven two sequences A and B, BLAST makes local alignments of pairs of
subsequences of A and B
• BLAST reports local alignments between the query sequence A and a database sequence B
2
A
Balignment 1 alignment 2 alignment 3
• You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites
• Compares a DNA/protein query sequence to a sequence database and calculates the statistical significance (P-value) of matches
• Website for searching GenBank and other NCBI sequence databases: http://www.ncbi.nlm.nih.gov/BLASTCan be used to search the NCBI Nucleotide database (DNA sequences), as well as the NCBI Protein database
• There are 4 different types of BLAST search:BLASTP: searches a protein database with a protein queryBLASTN: searches DNA/RNA database with DNA/RNA queryBLASTX: searches a protein database with DNA/RNA queryTBLASTN: searches DNA/RNA database with protein query
• Many programs for sequence analysis/alignment (eg. CLUSTAL) expect the input sequences to be in FASTA formatEach sequence is preceded by a header line that starts with “>” followed by the sequence identifier>fruitflyMFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV>humanMQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ>mouse
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
FASTA format
• You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websiteseg., we can use the fruitfly Eyeless protein sequence as a BLAST query sequence to search the UniProt database:
MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV
Fruitfly Eyeless (898 amino acids long)
We go to www.uniprot.org and click on ‘Blast’ at the top:
• You will get a list of BLAST hits (database sequences with good alignments to your query, ie. to fruitfly Eyeless here):
• Each BLAST hit may have several local alignments to the query sequenceeg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and several local alignments are reported for this pair:
• BLAST assesses the statistical significance of high-scoring databases matches
• For each alignment between the query and a database protein, it calculates an E-value
• E-value: the number of database matches of a certain alignment score expected by chance, in a database of the size searched
• The lower the E-value, the more significant the alignment score for the sequence matchE=1 means that we expect 1 match of that alignment score just by chance, in a database of the size searchedE=10-5 means that we expect to see 10-5 matches of that alignment score just by chance, in a database of that size
• Significant BLAST hits are possibly homologues• We use the E-value to judge if the database
sequence is a homologue of the queryIf E ≤ 10-5, we are confident that the hit is a homologueIf E is 10-5―10, we are not sure if the hit is a homologueIf E is > 10, we are doubtful that the hit is a homologueeg. searching UniProt using fruitfly Eyeless as our query:
eg. searching the NCBI Protein Database using fruitfly Eyeless as our query:
............
BLAST matches with high E-valuesmay not be homologues (although itis often hard to tell if they are or not!)
• Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus:
(i) What does an E value of 189 mean? (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene?
Problem
• Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus:
(i) What does an E value of 189 mean? An E-value of 189 means that we expect to see 189 BLAST hits with an alignment score as high as the top BLAST hit (ie. 28.9) by chance, when we search a database of the size searched(ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene? An E-value of 189 is high, so we can’t be confident the top BLAST hit is a homologue of our query. We shouldn’t predict the function of our query sequence based on such a weak BLAST hit
Answer
Further Reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Chapter 6 in Deonier et al Computational Genome Analysis