Best Practices for BLAST Searching, Analysis, and Post
Transcript of Best Practices for BLAST Searching, Analysis, and Post
Best Practices for BLAST® Searching, Analysis, and Post-processing
PIUG Boston Biotech Meeting 2010Gin-Yun Eggerichs
Chemical Abstracts Service
2
Agenda
• Understanding BLAST settings– Settings for long sequences– Settings for short sequences
• Analyzing BLAST results– % score vs. % identity
• Post-processing
3
• Basic Local Alignment Search Tool– Developed by Eugene Myers, Stephen Altschul, Warren
Gish, David J. Lipman, and Webb Miller – Originally developed to assist scientists
• Identify sequence and its possible functions• Increase search speed• No loss of search sensitivity
– Optimized for long sequences (over 50 residues)• BLAST parameters need to change for short sequences
• For more information on BLAST, visit:– www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
BLAST background
4
BLAST default settings for long sequences
• BLAST default settings are fine for most searches– Sequences greater than 80-85 residues– Expectation value (E value): 10
• Can change to smaller E value (0.01 or 0.001)
– Word size: 3– Weight Matrix: BLOSUM 62– Gap Cost: Open: 11 Extend: 1– Low Complexity Filtering: ON
• Can change to turn OFF the filtering
5
Expectation value (E value) is a statistical significance threshold • Dependent on the length of the query sequence
and the size of the database • Represents the number of hits with a score equal
to or better than “S” that would be “expected” by chance – An E value of 10 (default) means that 10 hits with
scores equal to or better than the score “S” are expected to occur by chance
• The lower the E value (towards 0), the more significant the retrieved sequence hits are
6
Decreasing E value most often removes sequences with lowest scores
E=10 retrieved sequences from all score ranges.
E=0.01 retrieved sequences with BLAST scores of 40 and above.
7
Weight matrix is a set of scores used to determine the similarity of the sequences• BLOSUM matrix (BLOcks SUbstitution Matrix)
– Substitution frequency based on observations– BLOSUM62, a matrix that calculates sequence
comparison with no less than 62% identity• PAM matrix (Point Accepted Mutation)
– Matrix calculated based on comparison of closely related proteins
More similar
BLOSUM 62(Default)
BLOSUM 80
PAM 30
BLOSUM 45
PAM 70
Less similar
8
Low complexity filtering masks areas with unusual compositions• Low complex sequences can retrieve false
positives– Repeating residue segments result in higher scores– The rest of the sequence may not match up well– The retrieved sequences may be of little relevance
• More relevant sequences that have lower scores may be excluded from the BLAST answer set
• Turn OFF the low complexity filtering if– More answers are required– Not interested in biologically interesting regions– Short sequences are desired
10
Optimized BLAST settings for long sequences• BLAST default settings are fine for most searches
– Sequences greater than 80-85 residues– Expectation value (E value): 0.01 or 0.001– Word size: 3– Weight Matrix: BLOSUM 62– Gap Cost: Open: 11 Extend: 1– Low Complexity Filtering: OFF
• For patent searching, turn OFF the Low Complex Filtering• For non-patent searching, leave the Low Complex Filtering ON
11
BLAST settings for short sequences• Sequences less than 50 residues
– Expectation value (E value): 1,000• Short sequence may match perfectly, but have low statistical
value
– Word size: 3 or 2• The query length must be at least twice the word size
– Weight Matrix• SQL 30-50 = PAM 70• SQL <30 = PAM 30
– Low Complexity Filtering: OFF– Gap Cost: Open: 10 Extend: 1
• Use the default Gap Cost associated with the Weight Matrix
12
Recommended NCBI BLAST settings
Protein > 85 residues
Protein = 50‐85 residues
Proteins = 35‐50 residues
Proteins <35 residues
E value 10 10 1,000 1,000
Word Size 3 3 3 or 2 2
Weight Matrix BLOSUM 62 BLOSUM‐80 PAM 70 PAM 30
Gap Cost 11, 1 10, 1 10, 1 9, 1
Low Complexity Filter
ON OFF OFF OFF
13
Agenda
• Understanding BLAST settings– Settings for long sequences– Settings for short sequences
• Analyzing BLAST results– % score vs. % identity
• Post-processing
14
BLAST (or bit) scores are normalized raw scores
• Raw score: Σ(identities, mismatch) - Σ(gap penalties)– Not informative because it is not
standardized– Difficult to compare alignments, especially
if different matrices are used• In theory, the higher the score, the better the
alignment
15
Identities are useful ways to judge the relevance of a sequence • Patents often claim sequences using percent
identity• Identity
– The extent to which two sequences are the same (invariant)
552 is the highest BLAST score with 100% (260/260) identities.
16
BLAST scores are dependent on the sequence length• There may be sequences with higher percent
identities but lower scores– Lower scores can result because the length of the query
sequence compared to the answer is shorter
– Sequences with lower scores and high percent identity may be of interest
If sequences with better than 90% identities are needed, using the BLAST score alone would miss this sequence with a 452 BLAST score.
17
Agenda
• Understanding BLAST settings– Settings for long sequences– Settings for short sequences
• Analyzing BLAST results– % score vs. % identity
• Post-processing
19
Tables can be saved as an STN Express Table file or an Excel® file
The width of the tables can be re-sized by clicking on the border of each column.
Can highlight additional non-search terms.
20
Increase the alignment length when creating BLAST alignment reports
Change the alignment length to 3,000 to display sequence alignments under 3,000 residues.
22
• Use default settings for long sequences– Decrease E value (0.01 or 0.001)– Turn OFF the low complexity filtering
• If more answers are desired
• Change settings for short sequences– Increase E value to max (1,000)– Turn OFF the low complexity filtering– Change word size to 2– Change BLAST matrix to PAM 70 (30-50 residues) or
PAM 30 (less than 30 residues)
Summary
23
• Use percent identity to determine the number of exact matched residues– Higher BLAST scores do not always mean higher
percent identity• Sequence tables can be easily created and saved
using STN Express– Adjust the column width by double clicking on the
column– Add additional non-search term highlighting
• Increase the sequence alignment length when creating BLAST alignment reports
Summary (cont.)
24
• CAS staff– Lora Burgess– Brian Sweet– Tina Tomeo– Marie Sparks– Jason Anderson– Crystal Poole Bradley
• FIZ K staff– Jim Brown– Rob Austin
Special thanks to: