Best Practices for BLAST Searching, Analysis, and Post

25
Best Practices for BLAST ® Searching, Analysis, and Post-processing PIUG Boston Biotech Meeting 2010 Gin-Yun Eggerichs Chemical Abstracts Service

Transcript of Best Practices for BLAST Searching, Analysis, and Post

Best Practices for BLAST® Searching, Analysis, and Post-processing

PIUG Boston Biotech Meeting 2010Gin-Yun Eggerichs

Chemical Abstracts Service

2

Agenda

• Understanding BLAST settings– Settings for long sequences– Settings for short sequences

• Analyzing BLAST results– % score vs. % identity

• Post-processing

3

• Basic Local Alignment Search Tool– Developed by Eugene Myers, Stephen Altschul, Warren

Gish, David J. Lipman, and Webb Miller – Originally developed to assist scientists

• Identify sequence and its possible functions• Increase search speed• No loss of search sensitivity

– Optimized for long sequences (over 50 residues)• BLAST parameters need to change for short sequences

• For more information on BLAST, visit:– www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

BLAST background

4

BLAST default settings for long sequences

• BLAST default settings are fine for most searches– Sequences greater than 80-85 residues– Expectation value (E value): 10

• Can change to smaller E value (0.01 or 0.001)

– Word size: 3– Weight Matrix: BLOSUM 62– Gap Cost: Open: 11 Extend: 1– Low Complexity Filtering: ON

• Can change to turn OFF the filtering

5

Expectation value (E value) is a statistical significance threshold • Dependent on the length of the query sequence

and the size of the database • Represents the number of hits with a score equal

to or better than “S” that would be “expected” by chance – An E value of 10 (default) means that 10 hits with

scores equal to or better than the score “S” are expected to occur by chance

• The lower the E value (towards 0), the more significant the retrieved sequence hits are

6

Decreasing E value most often removes sequences with lowest scores

E=10 retrieved sequences from all score ranges.

E=0.01 retrieved sequences with BLAST scores of 40 and above.

7

Weight matrix is a set of scores used to determine the similarity of the sequences• BLOSUM matrix (BLOcks SUbstitution Matrix)

– Substitution frequency based on observations– BLOSUM62, a matrix that calculates sequence

comparison with no less than 62% identity• PAM matrix (Point Accepted Mutation)

– Matrix calculated based on comparison of closely related proteins

More similar

BLOSUM 62(Default)

BLOSUM 80

PAM 30

BLOSUM 45

PAM 70

Less similar

8

Low complexity filtering masks areas with unusual compositions• Low complex sequences can retrieve false

positives– Repeating residue segments result in higher scores– The rest of the sequence may not match up well– The retrieved sequences may be of little relevance

• More relevant sequences that have lower scores may be excluded from the BLAST answer set

• Turn OFF the low complexity filtering if– More answers are required– Not interested in biologically interesting regions– Short sequences are desired

9

Turning OFF the Low Complexity Filtering typically retrieves more results

10

Optimized BLAST settings for long sequences• BLAST default settings are fine for most searches

– Sequences greater than 80-85 residues– Expectation value (E value): 0.01 or 0.001– Word size: 3– Weight Matrix: BLOSUM 62– Gap Cost: Open: 11 Extend: 1– Low Complexity Filtering: OFF

• For patent searching, turn OFF the Low Complex Filtering• For non-patent searching, leave the Low Complex Filtering ON

11

BLAST settings for short sequences• Sequences less than 50 residues

– Expectation value (E value): 1,000• Short sequence may match perfectly, but have low statistical

value

– Word size: 3 or 2• The query length must be at least twice the word size

– Weight Matrix• SQL 30-50 = PAM 70• SQL <30 = PAM 30

– Low Complexity Filtering: OFF– Gap Cost: Open: 10 Extend: 1

• Use the default Gap Cost associated with the Weight Matrix

12

Recommended NCBI BLAST settings

Protein > 85 residues

Protein = 50‐85 residues

Proteins = 35‐50 residues

Proteins <35 residues

E value 10 10 1,000 1,000

Word Size 3 3 3 or 2 2

Weight Matrix   BLOSUM 62 BLOSUM‐80 PAM 70 PAM 30

Gap Cost 11, 1 10, 1 10, 1 9, 1

Low Complexity Filter

ON OFF OFF OFF

13

Agenda

• Understanding BLAST settings– Settings for long sequences– Settings for short sequences

• Analyzing BLAST results– % score vs. % identity

• Post-processing

14

BLAST (or bit) scores are normalized raw scores

• Raw score: Σ(identities, mismatch) - Σ(gap penalties)– Not informative because it is not

standardized– Difficult to compare alignments, especially

if different matrices are used• In theory, the higher the score, the better the

alignment

15

Identities are useful ways to judge the relevance of a sequence • Patents often claim sequences using percent

identity• Identity

– The extent to which two sequences are the same (invariant)

552 is the highest BLAST score with 100% (260/260) identities.

16

BLAST scores are dependent on the sequence length• There may be sequences with higher percent

identities but lower scores– Lower scores can result because the length of the query

sequence compared to the answer is shorter

– Sequences with lower scores and high percent identity may be of interest

If sequences with better than 90% identities are needed, using the BLAST score alone would miss this sequence with a 452 BLAST score.

17

Agenda

• Understanding BLAST settings– Settings for long sequences– Settings for short sequences

• Analyzing BLAST results– % score vs. % identity

• Post-processing

18

Sequence alignment reports or tables can be easily created using STN Express®

19

Tables can be saved as an STN Express Table file or an Excel® file

The width of the tables can be re-sized by clicking on the border of each column.

Can highlight additional non-search terms.

20

Increase the alignment length when creating BLAST alignment reports

Change the alignment length to 3,000 to display sequence alignments under 3,000 residues.

21

BLAST alignment report can be saved as .REP or .RTF format

Page 1

Page 1

22

• Use default settings for long sequences– Decrease E value (0.01 or 0.001)– Turn OFF the low complexity filtering

• If more answers are desired

• Change settings for short sequences– Increase E value to max (1,000)– Turn OFF the low complexity filtering– Change word size to 2– Change BLAST matrix to PAM 70 (30-50 residues) or

PAM 30 (less than 30 residues)

Summary

23

• Use percent identity to determine the number of exact matched residues– Higher BLAST scores do not always mean higher

percent identity• Sequence tables can be easily created and saved

using STN Express– Adjust the column width by double clicking on the

column– Add additional non-search term highlighting

• Increase the sequence alignment length when creating BLAST alignment reports

Summary (cont.)

24

• CAS staff– Lora Burgess– Brian Sweet– Tina Tomeo– Marie Sparks– Jason Anderson– Crystal Poole Bradley

• FIZ K staff– Jim Brown– Rob Austin

Special thanks to:

See us at the booth for details!

Best Practices for BLAST Searching, Analysis, and Post-processing