Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

62
1 Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA Introduction to bioinformatics Stinus Lindgreen [email protected] Bioinformatics Centre, University of Copenhagen

description

Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA. Introduction to bioinformatics Stinus Lindgreen [email protected] Bioinformatics Centre, University of Copenhagen. Outline of the lecture. Scoring an alignment: BLOSUM and PAM Scoring matrices for nucleotides - PowerPoint PPT Presentation

Transcript of Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

Page 1: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

1

Pairwise alignment 2:Scoring matrices and gaps BLAST, BLAT and FASTA

Introduction to bioinformatics

Stinus [email protected]

Bioinformatics Centre, University of Copenhagen

Page 2: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

2

Outline of the lecture Scoring an alignment: BLOSUM and PAM Scoring matrices for nucleotides Treatment of gaps The BLAST algorithm Position Specific Scoring Matrices PSI-BLAST and other variants BLAT FASTA

Page 3: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

3

Definition of scoring systemTo build an alignment we need a scoring system Aligning two different residues indicates a

substitution How likely is it to replace A with B? A likely alignment should receive a large score

Create a substitution matrix A 20×20 symmetrical table (matrix) Look up the score for aligning two amino acids The diagonal represents conservation

Page 4: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

4

ConsiderationsConservative substitutions Prefer changes that respect structure and

function E.g. hydrophobicity, size, chemical properties etc.

Frequencies How often do specific residues occur? Scores should weigh the rare ones higher

Evolution How distant are the sequences? Ancient divergence more changes

Page 5: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

5

The creationApproach 1: Make an evolutionary model Use closely related sequences Extrapolate to greater distances Done in the PAM family

Approach 2: Look at related sequences Observe actual mutations in motifs Use sets of different overall identity Done in the BLOSUM family

Page 6: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

6

[1] M.O. Dayhoff: Survey of new data and computer methods of analysis (1978), Atlas of protein sequence and structure, 5:3

The PAM250 matrix[1]

Page 7: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

7

The PAM familyDefining characteristics PAM: Point (or percent) accepted mutations One PAM unit: 1 change per 100 residues (~

10M ys) Not the probability that a residue A is

something else What would PAM250 mean in that case?

But: The number of expected changes overall Including changes A B A

Page 8: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

8

Construction of PAM matricesThe general idea: Assume evolution is independent Describe ”one step of evolution” The next step will follow

PAMx: Scoring matrix for x evolutionary steps

Very similar sequences: Low x Very divergent sequences: Large x

What is an evolutionary step?

Page 9: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

9

Evolutionary modelAssume independence between positions over time P(AB)=P(BA)Consequences: How a sequence looks tomorrow depends

ONLY on how it looks today Not on how it looked yesterday Known as a Markov chain

The context does not matter Evolution ”reversible”

Page 10: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

10

Building PAM1

1. Find set of very similar sequences (1% ID)2. Make global alignment (71 groups)3. Count the substitutions (1572 changes)4. Calculate weights

Page 11: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

11

From counts to scores

The simplified version Count mutation frequencies

The respective probabilities Normalize by mutability of respective amino

acid Makes comparisons possible

Multiply to wanted evolutionary distance x Still probabilities

Calculate log-odds score for (AB+BA)/2 You have the PAMx matrix!

Page 12: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

12

Why log-odds?

Summation instead of multiplication log(A·B)=log(A)+log(B)

S>0: More often than per chance S=0: Number expected by chance S<0: Less often than per chance

ji

jiji pp

qS

,

, log

Page 13: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

13

The PAM1 matrixHighly identical sequences 1 evolutionary step PAM1 describes likelihood of changes Diagonal close to 1, off-diagonal close to 0 PAM2=PAM1×PAM1 PAM3=PAM1×PAM2 …

Markov chain property (independence) Greater divergence through matrix

multiplication Large PAM: More even values

Page 14: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

14

PAM problems

Good and bad things Based on evolution Assumes independence Global data evolution of proteins Based on small dataset Extrapolates to greater divergence

information loss & error growth Evolutionary model partly verified But still simplified… Mutations happen at the nucleotide level

Page 15: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

15

[2] S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s

The BLOSUM matrix[2]

Page 16: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

16

The BLOSUM family

Look at conserved motifs in proteins Scan databases for motifs No gaps in alignment a BLOCK

One protein can contain many motifs Make groups of motifs

~2000 blocks, >500 protein families The BLOCKS database BLOSUM: BLOCKS substitution matrices

Page 17: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

17

The BLOSUM idea

Don’t extrapolate data Collect divergent dataset Observe mutations directly Local alignment of sequences Evolution treated implicitly Still assumes independence between sites

Page 18: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

18

Building BLOSUMx

Substitution scores for sequences of x %ID1. Create sets of blocks of x %ID2. Very similar sequences are grouped and

weighted Avoid bias due to numbers

3. Count the substitutions4. Calculate log-odds scores for all pairs5. You have the BLOSUMx matrix!

Page 19: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

19

Some general notes BLOSUM62 has proved to work for a range of

similarities Other values (e.g. 30, 45 and 80) also used Today: BLOSUM used more than PAM

Page 20: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

20

BLOSUM problems

Good and bad things Evolution not considered explicitly Known related proteins Assumes star-shaped phylogeny Based on conserved blocks / Local alignment

Page 21: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

21

The matrix matters Example: BLAST using BLOSUM45 and

BLOSUM80 Note number of hits Note difference in scores

NCBI BLAST

>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

Page 22: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

22

Nucleotide scoring matricesNormally just simple 1/-1 cost scheme Problematic simplification Especially for structural sequences (e.g.

ncRNA) Consider evolution (PAM-like matrix) Use actual frequencies of nucleotides as

background Weigh transitions and transversions differently

Page 23: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

23

Gap penalties

We can score pairs of nucleotides/amino acids What about indels and gaps?Two main strategiesLinear gap penalty: G(n)=a·n Penalize each gap the same (cf. last lecture)Affine gap penalty: G(n)=O+E·n More evolutionary sound Opening a gap costs more than extending it Standard today

Page 24: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

24

Gaps: Things to consider

How large should O and E be? Depends on the scoring matrix used Normally good values are given Too large: No gaps created Too small: Too many gaps createdShould end gaps be penalized? Local alignment: No Global: Probably

Page 25: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

25

HeuristicsSo, now we know all about scoring an alignment Let us use Smith-Waterman or Needleman-

Wunsch Against a database or a full-length genome?

Bad idea!

Instead: Heuristic method Guaranteed to give a good solution fast Not necessarily the optimal alignment

Page 26: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

26

The BLAST algorithm

Basic Local Alignment Search ToolYou have all used it, but how does it work? Used to find statistically significant local

alignments between a query and a database Many versions (protein-protein, nucleotide-

nucleotide,nucleotide-protein …) Fast, widely used, good reason to know the

algorithm

Page 27: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

27

BLAST overview

The six steps1. Filtering of low complexity regions (optional)2. Compile list of relevant words3. Scan database sequences4. Extend hits to HSPs5. Calculate E-value for significant hits6. Report hits and alignments

Page 28: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

28

BLAST step 1

Filtering Some sequences contain low complexity

regions Give rise to many random hits Remember dotplots Filter out by replacing with Xs

Page 29: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

29

BLAST step 2

Compiling word list Word length L=3 (protein) or L=11 (nucleotides) Choose score threshold T (usually 11) Find all words of length L in query sequence

One for each position Compare to all possible length L words

(8000/~4.2M) For each position in query, remove words below T Limited to appr. 50 words per position

Page 30: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

30

The word listAPLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF…APL PLS LSA SAD ADQ DQA…

Word list, position 1 APL: APC,APS,APT,APE…Remove words w with score(APL,w)<T

Page 31: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

31

BLAST step 3

Scan database Store words for each position in efficient

search tree Scan each sequence in database Remember exact word hits If query length is 100, scan for appr. 50·100

hits

Page 32: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

32

Scanning

APL APA LPL IIAPLIGNESNAPAVQTLVGQLPLSHKARG…

Perfect match between word and database is a hit

Page 33: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

33

BLAST step 4Extend hits (BLAST2) HSP: High-scoring Segment Pair Find two hits on same diagonal with distance ≤A Connect them (ungapped alignment) Extend using gaps, matches and mismatches Extension continues while score increases Stop when score drops X below highest score

Original BLAST Extend all single word hits (higher T needed)

Page 34: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

34

BLAST extension Find diagonal hits Extend alignment

Page 35: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

35

BLAST step 5

Calculate E-scores Compile list of HSPs scoring more than S Let x be the score of a HSP The probability of seeing this score by chance

P(score≥x) The expectation of seeing this score in the

databaseE≈1-e-P(score≥x)D

Rather complicated calculations The Karlin-Altschul equation (written in many ways)

Page 36: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

36

BLAST step 6

Report the results List the hits (sorted by E-value) Graphical representation Show Smith-Waterman alignment of the HSPs

Page 37: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

37

Using BLAST

Settings Query sequence

Sub-sequence Database important Conserved Domain search: Additional info What organisms to search Scoring matrix Normally leave the rest as defaults

But you can change, well, anything

Page 38: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

38

Understanding BLASTOutput Database searched and query used Number of hits Color-coded diagram

Magnitude of score Relation between hits (if any)

Hits sorted by E-value Alignments Additional info by clicking a link

Page 39: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

39

BLAST summary Heuristic local alignment Finds word matches Extends locally Might miss optimal solutions Fast Lower E-value better result Many hits between query and sequence

possible Remember: Use a proper scoring matrix!

Page 40: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

40

Position Specific Scoring MatrixAnother brief detour: PSSM or Profile Useful to represent (part of) a multiple alignment Represents the information in a motif For each position: What are the frequencies of

the characters? (Nucleotides or amino acids) Frequencies can give the probabilities Find most probable hit to a PSSM in a sequence Compare to the background probability (i.e. the

overall frequencies) Pseudocounts to avoid 0 probabilities

Page 41: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

41

PSSM example123456789---------AAGGTAAGTTGTGTGAGTCAGGTATACATGGTAACTTAGGTACTGTAGGTCATTACAGTCAGTCAGGTTGGATCCGTAAGTGAGGTAAAC

| 1 2 3 4 5 6 7 8 9

-|--------------------

A| 3 6 1 0 0 6 7 2 1

C| 2 2 1 0 0 2 1 1 2

G| 1 1 7 10 0 1 1 5 1

T| 4 1 1 0 10 1 1 2 6

Just divide by 10 to get probabilitiesAssume equal background

distributionP(A)=P(C)=P(G)=P(T)=0.25

Page 42: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

42

PSSM score

Comparing sequence CAGGTAGTC to the PSSM Probability given the model0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016

Probability given the background0,259=0,000003815

log-odds score (bits)

72,5000003815,0

0002016,0loglog

bckg

model

P

P

Page 43: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

43

PSI-BLAST

Position-Specific-Iterated BLAST Designed to find distant homologs More sensitive than BlastP Use PSSM instead of just sequence comparison

Page 44: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

44

The PSI-BLAST algorithm

1. Perform standard BlastP search2. Create PSSM based on query and best hits3. Search database again for hits to the PSSM4. Incorporate new hits (i.e. update frequencies)5. Iterate (repeat) until convergence

End result: More distant homologs found Slower than standard BLAST (of course)

Page 45: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

45

PSI-BLASTing a protein Output at first iteration, 2nd iteration, … When to stop?

In this case: 7 iterations

PSI-BLAST

>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

Page 46: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

46

Other BLAST versions MegaBLAST

Assumes high similarity, longer sequences BLAST2SEQUENCES

Blast on only two sequences (local alignments) PHI-BLAST

Pattern Hit Initiated BLAST – searches for motif

WU-BLAST BLAST from Washington University, not NCBI

Page 47: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

47

BLAST variants Nucleotide vs nucleotide: blastn Protein vs protein: blastp Translated sequence vs protein database:

blastx Protein sequence vs translated database:

tblastn Translated sequence vs translated database:

tblastx Find divergent protein sequences: PSI-BLAST

Page 48: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

48

The BLAT algorithm

BLAST-Like Alignment Tool Designed for the genome projects Local alignments between long

sequences Speed important! BLAST turned upside-down

Page 49: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

49

What BLAT does

How does it work?1. Index the database (length 11 words)2. Find hits in the query

Opposite the BLAST strategy

3. Extend hits to HSPs

Useful when the database does not often change

Too time (and space) consuming for normal BLAST

Page 50: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

50

The FASTA algorithm

Similar to BLAST … but different Fast All (since it works with all alphabets) Not as widely used as BLAST – but one of the

first Works in a step-wise fashion Locate word hits, extend using Smith-

Waterman

Page 51: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

51

The steps in FASTA

1. Find word hits2. Score the hits and trim results3. Join regions of similarity4. Find the best alignment

Page 52: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

52

FASTA step 1

Word hits Choose word size ktup (2 for protein, 4 or 6 for

DNA) Create two word lists: Query and database Find all words that occur in both Connect nearby hits directly (i.e. no gaps)

Page 53: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

53

Word hits Find all hits Connect hits on same diagonal

Page 54: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

54

FASTA step 2

Score and trim Only keep the best 10 segments from step 1 Re-evaluate all hits using PAM250 For each hit: Note the best score

Page 55: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

55

Score and trim Keep 10 best hits Recalculate scores

Page 56: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

56

FASTA step 3

Join regions All regions scoring over a threshold are kept Crudely join regions Add linear gap penalty for joining to diagonals Keep the best scoring rough alignments Removes unlikely similar regions

Page 57: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

57

Join and penalize Keep the good regions Connect using gaps. Remove the low scoring

Page 58: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

58

FASTA step 4

Optimize alignment Consider the good but crude alignments Reoptimize using Smith-Waterman Only a small part of the matrix is needed Banded version (much faster)

Page 59: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

59

Optimize alignment Optimize crude alignment Use banded Smith-Waterman

Page 60: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

60

FASTA results Histogram of scores Actual scores versus

expected scores Optimal alignments E-value representing

the probability of the hit

Page 61: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

61

BLAST vs. FASTA

Differences between the two BLAST searches the neighbourhood FASTA looks for exact matches BLAST returns all the best hits in a sequence FASTA returns one hit per sequence BLAST is faster than FASTA FASTA produces better final alignment

Page 62: Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

62

Summary Introduction to Markov chains and HMMs Scoring matrices: PAM and BLOSUM

Larger PAM Greater distance Larger BLOSUM Greater identity

Gap penalties BLAST, PSI-BLAST etc. BLAT FASTA

Next time: Multiple alignment