BLAST

BLAST

Dr Avril [email protected]

Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint

• Sequence alignment has many usesSequence assembly – genome sequences are assembled by using sequence alignment methods to find overlaps between many short pieces of DNAGene finding – alignment of whole genome sequences from two or more species can aid in discovery of previously unknown genes Sequence divergence – the amount of sequence similarity between sequences (which can be calculated from a sequence alignment) tells us how closely they are relatedDatabase searching – we use fast sequence alignment methods (eg. BLAST) to determine whether a protein/DNA sequence is similar to any known sequence Prediction of function – if we know the function of a sequence, we can predict the function of similar sequences identified by database searching (eg. for fruitfly eyeless gene)

• The number of DNA and protein sequences in public databases is very largeNCBI Protein database has ~38,500,000 protein sequences

• Searching a database involves aligning the query sequence to each sequence in the database, to find significant local alignments

BLAST

VIVALASVEGAS

Align A to each B

TARQDEFGGADatabase sequences B

VIVADAVISIRYDDEQAKMKQIRALQPSTQREGHQIALMPLKMVQRRASTILHGGQWLC

etc. etc.

eg. predicted protein from a candidate gene (ORF)

Query sequence A

Database

BLAST

• Needleman-Wunsch & Smith-Waterman are too slow for searching databases

• Fast ‘heuristic’ methods are used eg. BLASTN.B. ‘heuristic’ means they’re not guaranteed to find the best solution (best alignment here), but they work okay

• BLAST was developed by Stephen Altschul & colleagues at NCBI in 1990NCBI = National Center for Biotechnology Information (USA)BLAST = ‘Basic Local Alignment Search Tool’

• The most used bioinformatics programAltschul’s 1997 paper on BLAST has been cited >26,000 times!

There are two main steps in BLAST It makes a list of words of length k (eg. k = 3 amino acids) in the query sequenceIt then looks for database sequences that share these wordsDatabase sequences that share many words with the query are used for the final alignments (step )

etc.

1

2

HIRTHIQLEQEWDSALIAAIQLE

PDADSTESKLAKAIQLFVCTTILCYTSKLADS

Database sequence 1

Database sequence 2

Doesn’t share wordsShareswords

ADSKLWLLFKSLMNDKPFKKADFFADSDSKSKL...

Query sequence

3-bp words

For a database sequence that shares many words with the query, it makes an alignmentA local alignment of the query & the database sequenceThe alignment contains the initial region with shared wordsHowever, the alignment may extend beyond that initial region

• BLAST finds islands of similarity between sequencesGiven two sequences A and B, BLAST makes local alignments of pairs of

subsequences of A and B

• BLAST reports local alignments between the query sequence A and a database sequence B

2

A

Balignment 1 alignment 2 alignment 3

• You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites

• Compares a DNA/protein query sequence to a sequence database and calculates the statistical significance (P-value) of matches

• Website for searching GenBank and other NCBI sequence databases: http://www.ncbi.nlm.nih.gov/BLASTCan be used to search the NCBI Nucleotide database (DNA sequences), as well as the NCBI Protein database

• There are 4 different types of BLAST search:BLASTP: searches a protein database with a protein queryBLASTN: searches DNA/RNA database with DNA/RNA queryBLASTX: searches a protein database with DNA/RNA queryTBLASTN: searches DNA/RNA database with protein query

http://www.ncbi.nlm.nih.gov/BLAST

• Many programs for sequence analysis/alignment (eg. CLUSTAL) expect the input sequences to be in FASTA formatEach sequence is preceded by a header line that starts with “>” followed by the sequence identifier>fruitflyMFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV>humanMQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ>mouse

MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ

FASTA format

• You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websiteseg., we can use the fruitfly Eyeless protein sequence as a BLAST query sequence to search the UniProt database:

MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV

Fruitfly Eyeless (898 amino acids long)

We go to www.uniprot.org and click on ‘Blast’ at the top:

http://www.uniprot.org/

• You will get a list of BLAST hits (database sequences with good alignments to your query, ie. to fruitfly Eyeless here):

• Each BLAST hit may have several local alignments to the query sequenceeg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and several local alignments are reported for this pair:

• BLAST assesses the statistical significance of high-scoring databases matches

• For each alignment between the query and a database protein, it calculates an E-value

• E-value: the number of database matches of a certain alignment score expected by chance, in a database of the size searched

• The lower the E-value, the more significant the alignment score for the sequence matchE=1 means that we expect 1 match of that alignment score just by chance, in a database of the size searchedE=10-5 means that we expect to see 10-5 matches of that alignment score just by chance, in a database of that size

• Significant BLAST hits are possibly homologues• We use the E-value to judge if the database

sequence is a homologue of the queryIf E ≤ 10-5, we are confident that the hit is a homologueIf E is 10-5―10, we are not sure if the hit is a homologueIf E is > 10, we are doubtful that the hit is a homologueeg. searching UniProt using fruitfly Eyeless as our query:

eg. searching the NCBI Protein Database using fruitfly Eyeless as our query:

............

BLAST matches with high E-valuesmay not be homologues (although itis often hard to tell if they are or not!)

• Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus:

(i) What does an E value of 189 mean? (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene?

Problem

• Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus:

(i) What does an E value of 189 mean? An E-value of 189 means that we expect to see 189 BLAST hits with an alignment score as high as the top BLAST hit (ie. 28.9) by chance, when we search a database of the size searched(ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene? An E-value of 189 is high, so we can’t be confident the top BLAST hit is a homologue of our query. We shouldn’t predict the function of our query sequence based on such a weak BLAST hit

Answer

Further Reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Chapter 6 in Deonier et al Computational Genome Analysis

BLAST

Education

Transcript of BLAST