©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices...

43
©CMBI 2005 Database Searching BLAST •Database Searching •Sequence Alignment •Scoring Matrices •Significance of an alignment •BLAST, algorithm •BLAST, parameters •BLAST, output •Alignment significance in BLAST

Transcript of ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices...

Page 1: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Database Searching BLAST

•Database Searching•Sequence Alignment•Scoring Matrices•Significance of an alignment

•BLAST, algorithm•BLAST, parameters•BLAST, output•Alignment significance in BLAST

Page 2: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Database Searching

Identify similarities between

novel query sequenceswhose structures and functions are unknown and uncharacterized

sequences in (public) databaseswhose structures and functions have been elucidated.

N.B. The similarity might span the entire query sequence or just part of it!

Page 3: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Database searching (2)

The query sequence is compared/aligned with every sequence in the database.

High-scoring database sequences are assumed to be evolutionary related to the query sequence.

If sequences are related by divergence from a common ancestor, there are said to be homologous.

Page 4: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

J.Leunissen©CMBI 2005

Sequence Alignment

The purpose of a sequence alignment is to line up all residues in the sequence that were derived from the same residue position in the ancestral gene or protein in any number of sequences

gap = insertion or deletion

A

B

B

A

Page 5: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Scoring Matrix/Substitution Matrix

To score quality of an alignment

Contains scores for pairs of residues (amino acids or nucleic acids) in a sequence alignment

For protein/protein comparisons: a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character (e.g. Ile, Leu) give higher scores compared to those of different character (e.g. Ile, Asp).

Symmetric

Page 6: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Substitution Matrices

Not all amino acids are equalResidues mutate more easily to similar onesResidues at surface mutate more easilyAromatics mutate preferably into aromatics

Mutations tend to favor some substitutionsCore tends to be hydrophobic

Selection tends to favor some substitutionsCysteines are dangerous at the surfaceCysteines in bridges seldom mutate

Page 7: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

PAM250 Matrix

Page 8: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Scoring example

Score of an alignment is the sum of the scores of all pairs of residues in the alignment

sequence 1: TCCPSIVARSNsequence 2: SCCPSISARNT

1 12 12 6 2 5 -1 2 6 1 0 => alignment score = 46

Page 9: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Dayhoff Matrix (1)

The group of Dayhoff created a scoring matrix from a dataset of closely similar protein sequences that could be aligned unambiguously.

Then they counted all mutations (and non-mutations) and calculated the mutation frequencies

With a bit of math, they converted these frequencies into the famous Dayhoff matrix (also called PAM matrix).

Page 10: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Given the frequency of Leu and Val in my sequences, do I see more mutations of V L than I would expect by chance?

Score of mutation A B

= log (observed a b mutation rate / expected number of mutations)

This is called a log odd and can be negative, zero, or positive.

When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues.

Dayhoff Matrix (2)

Page 11: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Dayhoff Matrix (3)

This log odds matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues

PAM 1 may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself.

PAM250: – 2,5 mutations per residue.– equivalent to 20% matches remaining between two sequences,

i.e. 80% of the amino acid positions are observed to have changed (one or more times).

– is default in many analysis packages.

Page 12: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLOSUM Matrix

Limit of Dayhoff matrix:Matrices based on the Dayhoff model of evolutionary rates are derived from alignments of sequences that are at least 85% identical; that might not be optimal…

An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences

Page 13: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLOSUM Matrix (2)

The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database.

The BLOCKS database utilizes the concept of blocks (un-gapped amino acid pattern), that act as signatures of a family of proteins.

Substitution frequencies for all pairs of amino acids were then calculated and this used to calculate a log odds BLOSUM matrix.

Different matrices are obtained by varying the identity threshold. For example, BLOSUM80 was derived using blocks of 80% identity.

Page 14: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Which Matrix to use?

Close relationships (Low PAM, high Blosum)Distant relationships (High PAM, low Blosum)

Often used defaults are: PAM250, BLOSUM62

BLOSUM 80 BLOSUM 62 BLOSUM 45PAM 20 PAM 120 PAM 250

More conserved More variable

Page 15: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Significance of alignment (1)

When is an alignment statistically significant?

In other words:

How much different is the alignment score found from scores obtained by aligning random sequences to the query sequence?

Or:

What is the probability that an alignment with this score could have arisen by chance?

Page 16: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Significance of alignment (2)

Database size= 20 x 106 letters

peptide #hits

A 1 x 106

AP 50000IAP 2500LIAP 125WLIAP 6KWLIAP 0,3KWLIAPY 0,015

Page 17: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST

Question: What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence?

•BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. •Very fast algorithm•Can be used to search extremely large databases•Sufficiently sensitive and selective for most purposes•Robust – the default parameters can usually be used

Page 18: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST – Algorithme

Step 1: Read/understand user query sequence.

Step 2: Use hashing technology to select several thousand likely candidates.

Step 3: Do a real alignment between the query sequence and those likely candidate. ‘Real alignment’ is a main topic of this course.

Step 4: Present output to user.

Page 19: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Algorithm, Step 2

For a given word length w and a given score matrix:Create a list of all words (w-mers) that can score >T

when compared to w-mers from the query.

P D G 13

P Q A 12

P Q N 12etc.

Below Threshold (T=13)

Query Sequence L N K C K T P Q G Q R L V N Q

P Q G 18

P E G 15 P R G 14

P K G 14 P N G 13

Neighborhood Words

Word

P M G 13

Page 20: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Algorithm, Step 2

• Each neighbourhood word gives all positions in the database where it is found (hit list).

P D G 13

P Q G 18P E G 15 P R G 14P K G 14 P N G 13

P M G 13 PMG Database

Page 21: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Algorithm, Step 2

The program extends matching segments (seeds) in both directions by adding residues. Residues will be added until the incremental score drops below a threshold.

Page 22: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Basic BLAST Algorithms

Program Query Database

BLASTP Protein Protein

BLASTN DNA DNA

BLASTX translatedDNA protein

TBLASTN protein translatedDNA

TBLASTX translatedDNA translatedDNA

Page 23: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

PSI-BLAST

Position-Specific Iterated BLAST• Distant relationships are often best detected by motif

or profile searches rather than pair-wise comparisons• PSI-BLAST first performs a BLAST search. • PSI-BLAST uses the information from significant

BLAST alignments returned to construct a position specific score matrix, which replaces the query sequence for the next round of database searching.

• PSI-BLAST may be iterated until no new significant alignments are found.

Page 24: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Input

Steps in running BLAST:

•Entering your query sequence (cut-and-paste)•Select the database(s) you want to search•Choose output parameters•Choose alignment parameters (scoring matrix, filters,….)

Example query=MAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN

Page 25: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Output (1)

Page 26: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Output (2)

A high score, or preferably, clusters of high scores, indicates a likely relationship

A low probability indicates that a match is unlikely to ave arisen by chance

Page 27: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Output (3)

Low scores with high probabilities suggest that matches have arisen by chance

Page 28: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Alignment Significance in BLAST

P-value (probability) Relates the score for an alignment to the likelihood that it arose by chance. The closer to zero, the greater the confidence that the hit is real.

E-value (expect value)The number of alignments with E that would be expected by chance in that database (e.g. if E=10, 10 matches with scores this high are expected to be found by chance).A match will be reported if its E is below the threshold.Lower E thresholds are more stringent, and report fewer matches.

Page 29: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Output (4)

Page 30: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Output (5)

Page 31: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

BLAST Output (6)

Page 32: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Low complexity filter

Page 33: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Low complexity filter

Page 34: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Low complexity filter

Page 35: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Local implementation - Blast in MRS

MRS also contains a BLAST. This BLAST is simpler, has fewer options, knows fewer databases, but is faster.

Page 36: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Blast in MRS

MRS Blast remembers all your queries from one session, and stores them in a table. The one you are running is in that table too. Multiple BLASTs can run at one time.

Still running Ready

Page 37: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Blast hitlist in MRS

Page 38: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Blast hitlist expansion in MRS

Page 39: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Blast hitlist expansion in MRS

Page 40: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Low complexity motifs visible

Page 41: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Routing

Page 42: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Routing to Clustal

Page 43: ©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

©CMBI 2005

Routing MRS to Blast