Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

1

Pairwise alignment 2:Scoring matrices and gaps BLAST, BLAT and FASTA

Introduction to bioinformatics

Stinus [email protected]

Bioinformatics Centre, University of Copenhagen

2

Outline of the lecture Scoring an alignment: BLOSUM and PAM Scoring matrices for nucleotides Treatment of gaps The BLAST algorithm Position Specific Scoring Matrices PSI-BLAST and other variants BLAT FASTA

3

Definition of scoring systemTo build an alignment we need a scoring system Aligning two different residues indicates a

substitution How likely is it to replace A with B? A likely alignment should receive a large score

Create a substitution matrix A 20×20 symmetrical table (matrix) Look up the score for aligning two amino acids The diagonal represents conservation

4

ConsiderationsConservative substitutions Prefer changes that respect structure and

function E.g. hydrophobicity, size, chemical properties etc.

Frequencies How often do specific residues occur? Scores should weigh the rare ones higher

Evolution How distant are the sequences? Ancient divergence more changes

5

The creationApproach 1: Make an evolutionary model Use closely related sequences Extrapolate to greater distances Done in the PAM family

Approach 2: Look at related sequences Observe actual mutations in motifs Use sets of different overall identity Done in the BLOSUM family

6

[1] M.O. Dayhoff: Survey of new data and computer methods of analysis (1978), Atlas of protein sequence and structure, 5:3

The PAM250 matrix[1]

7

The PAM familyDefining characteristics PAM: Point (or percent) accepted mutations One PAM unit: 1 change per 100 residues (~

10M ys) Not the probability that a residue A is

something else What would PAM250 mean in that case?

But: The number of expected changes overall Including changes A B A

8

Construction of PAM matricesThe general idea: Assume evolution is independent Describe ”one step of evolution” The next step will follow

PAMx: Scoring matrix for x evolutionary steps

Very similar sequences: Low x Very divergent sequences: Large x

What is an evolutionary step?

9

Evolutionary modelAssume independence between positions over time P(AB)=P(BA)Consequences: How a sequence looks tomorrow depends

ONLY on how it looks today Not on how it looked yesterday Known as a Markov chain

The context does not matter Evolution ”reversible”

10

Building PAM1

1. Find set of very similar sequences (1% ID)2. Make global alignment (71 groups)3. Count the substitutions (1572 changes)4. Calculate weights

11

From counts to scores

The simplified version Count mutation frequencies

The respective probabilities Normalize by mutability of respective amino

acid Makes comparisons possible

Multiply to wanted evolutionary distance x Still probabilities

Calculate log-odds score for (AB+BA)/2 You have the PAMx matrix!

12

Why log-odds?

Summation instead of multiplication log(A·B)=log(A)+log(B)

S>0: More often than per chance S=0: Number expected by chance S<0: Less often than per chance

ji

jiji pp

qS

,

, log

13

The PAM1 matrixHighly identical sequences 1 evolutionary step PAM1 describes likelihood of changes Diagonal close to 1, off-diagonal close to 0 PAM2=PAM1×PAM1 PAM3=PAM1×PAM2 …

Markov chain property (independence) Greater divergence through matrix

multiplication Large PAM: More even values

14

PAM problems

Good and bad things Based on evolution Assumes independence Global data evolution of proteins Based on small dataset Extrapolates to greater divergence

information loss & error growth Evolutionary model partly verified But still simplified… Mutations happen at the nucleotide level

15

[2] S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s

The BLOSUM matrix[2]

16

The BLOSUM family

Look at conserved motifs in proteins Scan databases for motifs No gaps in alignment a BLOCK

One protein can contain many motifs Make groups of motifs

~2000 blocks, >500 protein families The BLOCKS database BLOSUM: BLOCKS substitution matrices

17

The BLOSUM idea

Don’t extrapolate data Collect divergent dataset Observe mutations directly Local alignment of sequences Evolution treated implicitly Still assumes independence between sites

18

Building BLOSUMx

Substitution scores for sequences of x %ID1. Create sets of blocks of x %ID2. Very similar sequences are grouped and

weighted Avoid bias due to numbers

3. Count the substitutions4. Calculate log-odds scores for all pairs5. You have the BLOSUMx matrix!

19

Some general notes BLOSUM62 has proved to work for a range of

similarities Other values (e.g. 30, 45 and 80) also used Today: BLOSUM used more than PAM

20

BLOSUM problems

Good and bad things Evolution not considered explicitly Known related proteins Assumes star-shaped phylogeny Based on conserved blocks / Local alignment

21

The matrix matters Example: BLAST using BLOSUM45 and

BLOSUM80 Note number of hits Note difference in scores

NCBI BLAST

>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

22

Nucleotide scoring matricesNormally just simple 1/-1 cost scheme Problematic simplification Especially for structural sequences (e.g.

ncRNA) Consider evolution (PAM-like matrix) Use actual frequencies of nucleotides as

background Weigh transitions and transversions differently

23

Gap penalties

We can score pairs of nucleotides/amino acids What about indels and gaps?Two main strategiesLinear gap penalty: G(n)=a·n Penalize each gap the same (cf. last lecture)Affine gap penalty: G(n)=O+E·n More evolutionary sound Opening a gap costs more than extending it Standard today

24

Gaps: Things to consider

How large should O and E be? Depends on the scoring matrix used Normally good values are given Too large: No gaps created Too small: Too many gaps createdShould end gaps be penalized? Local alignment: No Global: Probably

25

HeuristicsSo, now we know all about scoring an alignment Let us use Smith-Waterman or Needleman-

Wunsch Against a database or a full-length genome?

Bad idea!

Instead: Heuristic method Guaranteed to give a good solution fast Not necessarily the optimal alignment

26

The BLAST algorithm

Basic Local Alignment Search ToolYou have all used it, but how does it work? Used to find statistically significant local

alignments between a query and a database Many versions (protein-protein, nucleotide-

nucleotide,nucleotide-protein …) Fast, widely used, good reason to know the

algorithm

27

BLAST overview

The six steps1. Filtering of low complexity regions (optional)2. Compile list of relevant words3. Scan database sequences4. Extend hits to HSPs5. Calculate E-value for significant hits6. Report hits and alignments

28

BLAST step 1

Filtering Some sequences contain low complexity

regions Give rise to many random hits Remember dotplots Filter out by replacing with Xs

29

BLAST step 2

Compiling word list Word length L=3 (protein) or L=11 (nucleotides) Choose score threshold T (usually 11) Find all words of length L in query sequence

One for each position Compare to all possible length L words

(8000/~4.2M) For each position in query, remove words below T Limited to appr. 50 words per position

30

The word listAPLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF…APL PLS LSA SAD ADQ DQA…

Word list, position 1 APL: APC,APS,APT,APE…Remove words w with score(APL,w)<T

31

BLAST step 3

Scan database Store words for each position in efficient

search tree Scan each sequence in database Remember exact word hits If query length is 100, scan for appr. 50·100

hits

32

Scanning

APL APA LPL IIAPLIGNESNAPAVQTLVGQLPLSHKARG…

Perfect match between word and database is a hit

33

BLAST step 4Extend hits (BLAST2) HSP: High-scoring Segment Pair Find two hits on same diagonal with distance ≤A Connect them (ungapped alignment) Extend using gaps, matches and mismatches Extension continues while score increases Stop when score drops X below highest score

Original BLAST Extend all single word hits (higher T needed)

34

BLAST extension Find diagonal hits Extend alignment

35

BLAST step 5

Calculate E-scores Compile list of HSPs scoring more than S Let x be the score of a HSP The probability of seeing this score by chance

P(score≥x) The expectation of seeing this score in the

databaseE≈1-e-P(score≥x)D

Rather complicated calculations The Karlin-Altschul equation (written in many ways)

36

BLAST step 6

Report the results List the hits (sorted by E-value) Graphical representation Show Smith-Waterman alignment of the HSPs

37

Using BLAST

Settings Query sequence

Sub-sequence Database important Conserved Domain search: Additional info What organisms to search Scoring matrix Normally leave the rest as defaults

But you can change, well, anything

38

Understanding BLASTOutput Database searched and query used Number of hits Color-coded diagram

Magnitude of score Relation between hits (if any)

Hits sorted by E-value Alignments Additional info by clicking a link

39

BLAST summary Heuristic local alignment Finds word matches Extends locally Might miss optimal solutions Fast Lower E-value better result Many hits between query and sequence

possible Remember: Use a proper scoring matrix!

40

Position Specific Scoring MatrixAnother brief detour: PSSM or Profile Useful to represent (part of) a multiple alignment Represents the information in a motif For each position: What are the frequencies of

the characters? (Nucleotides or amino acids) Frequencies can give the probabilities Find most probable hit to a PSSM in a sequence Compare to the background probability (i.e. the

overall frequencies) Pseudocounts to avoid 0 probabilities

41

PSSM example123456789---------AAGGTAAGTTGTGTGAGTCAGGTATACATGGTAACTTAGGTACTGTAGGTCATTACAGTCAGTCAGGTTGGATCCGTAAGTGAGGTAAAC

| 1 2 3 4 5 6 7 8 9

-|--------------------

A| 3 6 1 0 0 6 7 2 1

C| 2 2 1 0 0 2 1 1 2

G| 1 1 7 10 0 1 1 5 1

T| 4 1 1 0 10 1 1 2 6

Just divide by 10 to get probabilitiesAssume equal background

distributionP(A)=P(C)=P(G)=P(T)=0.25

42

PSSM score

Comparing sequence CAGGTAGTC to the PSSM Probability given the model0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016

Probability given the background0,259=0,000003815

log-odds score (bits)

72,5000003815,0

0002016,0loglog

bckg

model

P

P

43

PSI-BLAST

Position-Specific-Iterated BLAST Designed to find distant homologs More sensitive than BlastP Use PSSM instead of just sequence comparison

44

The PSI-BLAST algorithm

1. Perform standard BlastP search2. Create PSSM based on query and best hits3. Search database again for hits to the PSSM4. Incorporate new hits (i.e. update frequencies)5. Iterate (repeat) until convergence

End result: More distant homologs found Slower than standard BLAST (of course)

45

PSI-BLASTing a protein Output at first iteration, 2nd iteration, … When to stop?

In this case: 7 iterations

PSI-BLAST

>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

46

Other BLAST versions MegaBLAST

Assumes high similarity, longer sequences BLAST2SEQUENCES

Blast on only two sequences (local alignments) PHI-BLAST

Pattern Hit Initiated BLAST – searches for motif

WU-BLAST BLAST from Washington University, not NCBI

47

BLAST variants Nucleotide vs nucleotide: blastn Protein vs protein: blastp Translated sequence vs protein database:

blastx Protein sequence vs translated database:

tblastn Translated sequence vs translated database:

tblastx Find divergent protein sequences: PSI-BLAST

48

The BLAT algorithm

BLAST-Like Alignment Tool Designed for the genome projects Local alignments between long

sequences Speed important! BLAST turned upside-down

49

What BLAT does

How does it work?1. Index the database (length 11 words)2. Find hits in the query

Opposite the BLAST strategy

3. Extend hits to HSPs

Useful when the database does not often change

Too time (and space) consuming for normal BLAST

50

The FASTA algorithm

Similar to BLAST … but different Fast All (since it works with all alphabets) Not as widely used as BLAST – but one of the

first Works in a step-wise fashion Locate word hits, extend using Smith-

Waterman

51

The steps in FASTA

1. Find word hits2. Score the hits and trim results3. Join regions of similarity4. Find the best alignment

52

FASTA step 1

Word hits Choose word size ktup (2 for protein, 4 or 6 for

DNA) Create two word lists: Query and database Find all words that occur in both Connect nearby hits directly (i.e. no gaps)

53

Word hits Find all hits Connect hits on same diagonal

54

FASTA step 2

Score and trim Only keep the best 10 segments from step 1 Re-evaluate all hits using PAM250 For each hit: Note the best score

55

Score and trim Keep 10 best hits Recalculate scores

56

FASTA step 3

Join regions All regions scoring over a threshold are kept Crudely join regions Add linear gap penalty for joining to diagonals Keep the best scoring rough alignments Removes unlikely similar regions

57

Join and penalize Keep the good regions Connect using gaps. Remove the low scoring

58

FASTA step 4

Optimize alignment Consider the good but crude alignments Reoptimize using Smith-Waterman Only a small part of the matrix is needed Banded version (much faster)

59

Optimize alignment Optimize crude alignment Use banded Smith-Waterman

60

FASTA results Histogram of scores Actual scores versus

expected scores Optimal alignments E-value representing

the probability of the hit

61

BLAST vs. FASTA

Differences between the two BLAST searches the neighbourhood FASTA looks for exact matches BLAST returns all the best hits in a sequence FASTA returns one hit per sequence BLAST is faster than FASTA FASTA produces better final alignment

62

Summary Introduction to Markov chains and HMMs Scoring matrices: PAM and BLOSUM

Larger PAM Greater distance Larger BLOSUM Greater identity

Gap penalties BLAST, PSI-BLAST etc. BLAT FASTA

Next time: Multiple alignment

Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

Documents

Transcript of Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA