What is BLAST? BLAST ® (Basic Local Alignment Search Tool) is a set of similarity search programs...

What is BLAST?What is BLAST?

BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA.

“local” means it searches and aligns sequence segments, rather than align the entire sequence. It’s able to detect relationships among sequences which share only isolated regions of similarity.

Currently, it is the most popular and most accepted sequence analysis tool.

Why BLAST?Why BLAST?• Identify unknown sequences - The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then you may have access to a wealth of biological information.

• Help gene/protein function and structure prediction – genes with similar sequences tend to share similar functions or structure.

• Identify protein family – group related (paralog or ortholog) genes and their proteins into a family.

•Prepare sequences for multiple alignments

• And more …

Different types of homology search

GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | |||||GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC

GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG

------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| |TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA

GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG||||||| |||| | | |||| ||||| || ||||| || |||||| |||||||||||||||GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG

DNA v.s. DNA

Protein v.s. Protein

DNA translated v.s. protein

Or the other way around

DNA translated v.s. DNA translated

Basic BLAST programs and databasesBasic BLAST programs and databases

Nucleotide DB Protein DB

Nucleotide Sequence

Protein Sequence

blastn blastp

Translated DB(contain amino acid sequences)

In 6 frames

In 6 frames

blastx

Translated Protein Sequence

tblastx

tblastn

How Does BLAST WorkHow Does BLAST WorkTwo-step procedure:

1. Compare query sequence to every database entries. For each entry, if there are segments of certain length (word size) similar to part of the query sequence, they have a hit.

Query: GTTGACCGTTAGCCGACGTTAAGCT

DB entry: ACATAGCCCGTTAGCCGCTGATACGACCGTAC

2. For each hit, extending two both ends until the expect value falls below the threshold. They become “high-scoring segment pair” (HSP)

3. A Smith-Waterman like algorithm is used to do local alignment around each HSP.

Word size = 7

Blastn ~ Construct QueriesBlastn ~ Construct Queries

paste your sequence here

specify search region

choose database

nr ~ non-redundant database

Others are subsets of nr database.

Blastn ~ OptionsBlastn ~ Options

limit result to from only certain organism

Example: protease NOT hiv1[Organism]

The smaller the word size, the higher the sensitivity.

Lower EXPECT thresholds are more stringent.

Blastn ~ FiltersBlastn ~ Filters

• Low-complexity: Some sequence segments are biologically uninteresting (e.g., hits against common acidic-, basic- or proline-rich regions) determined by SEG or DUST program. Such segments are screened out.

• Human repeats: This option masks Human repeats (LINE's and SINE's) and is especially useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases which contain large number of repeats (e.g. htgs).

• Mask for lookup table only: BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option tell BLAST search to apply other filters only in the first phase.

• Mask Lower Case: Sequences in lower case are screened out. This allows users to define customized filtering region.

Your query sequence is nucleotide sequence. Blastn can help to

• Find the identity of your query sequence.

• Find sequences similar to your query sequence.

Blastn returns nucleotide sequences stored in NCBI databases.

Blastn ~ When to UseBlastn ~ When to Use

Variance of blastn ~ MegaBlast :

It’s specifically designed to efficientlyefficiently (up to 10 times faster ) find longlong alignments between very similarvery similar sequences.

Interpret BLAST results - DistributionInterpret BLAST results - Distribution

This image shows the distribution of BLAST hits on the query sequence. Each line represents a hit. The span of a line represents the region where similarity is detected. Different colors represent different ranges of scores.

Query sequence

BLAST hits. Click to access the pairwise alignment.

Interpret BLAST results - DescriptionInterpret BLAST results - Description

ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank

Bit score – higher, better. Click to access the pairwise alignment

Expect value – lower, better. It tells the possibility that this is a random hit

Gene/sequence Definition

The description (also called definition) lines are listed below under the heading "Sequences producing significant alignments". The term "significant" simply refers to all those hits whose E value was less than the threshold. It does not imply biological significance.

Links

Interpret BLAST results – Pairwise Interpret BLAST results – Pairwise AlignmentAlignment

Query line: the segment from query sequence.

Subj line: the segment from hit (subject) sequence.

Middle line: the consensus bases

Blastp ~ Protein – Protein DBBlastp ~ Protein – Protein DB

Blastp is used for both identifying a query amino acid sequence and for finding similar sequences in protein databases. Like other BLAST programs, blastp is designed to find local regions of similarity. However, when sequence similarity spans the whole sequence, blastp will report a global alignment, which is the preferred result for protein identification purposes.

Unlike nucleotide BLAST, there is no comparable MEGABLAST for protein searches.

Blastp ~ Special ParametersBlastp ~ Special Parameters

Matrix: a table of scores that are assigned to various amino acid substitutions. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees.

BLOSUM-62 matrix is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. For short queries, PAM matrices may be used instead.

Gap: penalties for opening a new gap, or for extending an existing gap.

Exercise

Find out how the gap cost is calculated:For a length k gap, the cost is

Gap_exist + k * gap_ext OR Gap_exist + (k-1) * gap_ext

Blastp ~ Special ParametersBlastp ~ Special Parameters

For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:

Query Length Substitution Matrix Gap Costs

<35 PAM-30 (9,1)

35-50 PAM-70 (10,1)

50-85 BLOSUM-80 (10,1)

>85 BLOSUM-62 (10,1)

BLOSUM62 matrixC S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2

S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3

T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3

P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4

A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3

G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2

N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4

D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4

E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3

Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2

H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3

K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1

I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3

L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2

V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2

W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Basic idea

Conserved regions from multiple sources are aligned into blocks

The identity level is high therefore we know they are homologues without a score matrix

Frequency of AA pairs

37 columns, each column has 3*(3-1)/2 pairs. In total 111 pairs.

Pair I-L occurs 3 times. L-L occurs 13 times P_{IL} = 3/111. P_{LL}= 13/111

Total amino acid 111. P_I = 2/111, P_L = 21/111

2 * P_I * P_L < P_{IL}! P_L * P_L < P_{LL}!

Blosum

Score(x,y) = log_2 (p_{xy} / e_{xy}), where e_{xy} = 2 p_x p_y e_{xx} = p_x p_x

BLOSUM 62

Some protein families are more well studied so they are over represented in the database.

To remove this bias in statistics, those proteins are classified together before BLOSUM calculation.

BLOSUM 62

Weight 0.5Weight 0.5

Weight 1Weight 1

The sequences that are 62% or above similarity are grouped together and given total weight 1.

This way, the AA pairs are counted among groups that are 62% or below.

The lower this number is, the better is the matrix suitable to distant homology search.

Blastx ~ nucleotide – protein DBBlastx ~ nucleotide – protein DB

Blastx is useful for finding similar proteins to those encoded by a nucleotide query. It compares the translation of the nucleotide query sequence to a protein database. Because blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx search is often the first analysis performed with a read from a newly derived sequence and is used extensively in analyzing EST sequences.

Blastx ~ AttentionBlastx ~ Attention

ATTENTION:

1. You have to make sure that your sequence sequence is a nucleotide coding region.

2. Blastx is not applicable to Genomic DNA/RNA (introns, intergenic region, tRNA, rRNA), because they do not encode for protein.

Blastx ~ Special ParametersBlastx ~ Special Parameters

Different species may use different genetic codes to encode for the same amino acid. You have to specify appropriate genetic codes (translation table) for your query sequence based on the organism and sources.

Blastx ~ Interpret ResultsBlastx ~ Interpret Results

Middle line:

letters ~ consensus amino acid residues

+ ~ similar amino acid residue

white space ~ unmatched

Tblastn ~ protein – translated DBTblastn ~ protein – translated DB

A tblastn search allows you to compare a protein sequence to the six-frame translations of a nucleotide database. It can be a very productive way of finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), located in BLAST databases est and htgs, respectively.

Tblastx ~ nucleotide – translated DBTblastx ~ nucleotide – translated DBtblastx takes a nucleotide query sequence, translates it in all six frames, and compares those translations to the database sequences dynamically translated in all six frames. This effectively performs a more sensitive blastp search without doing the manual translation.

tblastx gets around the the potential frame-shift and ambiguities that may prevent certain open reading frames from being detected. This is very useful in identifying potential proteins encoded by single pass read ESTs. In addition, it would be a good tool for identifying novel genes.

Other blast programsOther blast programsPSI blast: Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to..."

Other blast programsOther blast programsBLAST 2 sequences: BLAST 2 Sequences" is designed for direct comparison of two sequences. This program takes two input sequences and compares them directly. Please note that "BLAST 2 Sequences" regards the second sequence as the database. If the database sequence or second query is present in NCBI databases, using GI/Accession instead of the FASTA sequence would allow the program to incorporate the translation and other sequence features, found in that record, into the final result to make it more informative.

Other blast programsOther blast programsSearch for short and near exact matches: Normal parameters for standard blast are too stringent for short query sequences. Therefore, appropriate parameters are set for short and near exact matches.

• For Nucleotide (<20bp): A common use is to check the specificity of primers used in the polymerase chain reaction (PCR) or hybridization. Forward primer – NNNNNNNNNN – reverse primer. Since BLAST looks for local alignments and searches both strands, there is no need to reverse complement one of the primers before doing the concatenation or the search. Use word size 7, E value 1000, no filter.

• For protein (< 10-15mer): using matrix PAM30, E value 20000, word size 2, no filter.

Summary - Summary - If your sequence is NUCLEOTIDEIf your sequence is NUCLEOTIDE

Length DB Purpose Program

20 bp

or longer

Nucl Identify the query sequence MegaBlast

blastn

Find sequences similar to query sequence

blastn

Find similar proteins to translated query in a translated database

tblastx

Prot Find similar proteins to translated query in a protein database

blastx

7-20 bp Nucl Find primer binding sites or map short contiguous motifs

Search for short, nearly exact matches

Summary - Summary - If your sequence is PROTEINIf your sequence is PROTEIN

Length DB Purpose Program

15 residue

or longer

Prot Identify the query sequence or find protein sequences similar to query

blastp

Find members of a protein family or build a custom position-specific score matrix

PSI-blast

Nucl Find similar proteins in a translated nucleotide database

tblastn

5-15 residue

Prot Search for peptide motifs Search for short, nearly exact matches

Raw Score, Bit Score, P-value and E-value

Score Matrix

BLOSUM62

Raw Score and E-value

VLNVWGKVEAD VLKCWGPMEAD raw score = S(V,V)+S(L,L)+S(N,K)+…+S(D,D)

Both sequences are substrings of the query and the subject (database).

Because there is no gap, this is called an HSP High-Scoring Segment Pair.

Is this HSP significant? Can it occur purely by chance? E-value of this raw score is the number of expected

occurrences if both query and database are random sequences.

How to compute E-value from raw score

There is rigorous mathematical analysis behind this. But we only need to know that If query sequence has length m, and database

has length n, then by chance, the number of non-overlapping HSPs with score x is expected to be

K*m*n*exp(- lambda * x)This makes sense

Doubling the length of either sequence should double the number of HSPs attaining a given score.

Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score

Bit Score

Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda.

Bit score is the “normalized” score

Therefore, E-value = m*n*(2^bitscore)

Exercise

Retrieve myoglobin horse. BLASTp

What do you get?What is Hemoglobin?

TBLASTFind the DNA sequence corresponding to

myoglobin horse. Can you do the reverse-translation

without knowing the DNA sequence?

What is BLAST? BLAST ® (Basic Local Alignment Search Tool) is a set of similarity search programs...

Documents

Transcript of What is BLAST? BLAST ® (Basic Local Alignment Search Tool) is a set of similarity search programs...