Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x...

Post on 12-Jan-2016

222 views 0 download

Tags:

Transcript of Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x...

Evolution and Scoring Rules

Example Score = 5 x (# matches) + (-4) x (# mismatches) +

+ (-7) x (total length of all gaps)

Example Score = 5 x (# matches) + (-4) x (# mismatches) +

+ (-5) x (# gap openings) + (-2) x (total length of all gaps)

Scoring Matrices

Scoring Rules vs. Scoring Matrices Nucleotide vs. Amino Acid Sequence The choice of a scoring rule can strongly

influence the outcome of sequence analysis Scoring matrices implicitly represent a

particular theory of evolution Elements of the matrices specify the

similarity of one residue to another

DNA: A T G C

1:1

RNA: A U G C

3:1

Protein: 20 amino acids

Transcription

Translation

Replication

Translation - Protein Synthesis: Every 3 nucleotides (codon) are translated into one amino acid

Nucleotide sequence determines the amino acid sequence

Translation - Protein Synthesis

5’ -> 3’ : N-term -> C-term RNA Protein

Log Likelihoods used as Scoring Matrices:

PAM - % Accepted Mutations:1500 changes in 71 groups w/ > 85%

similarity

BLOSUM – Blocks Substitution Matrix:2000 “blocks” from 500 families

Log Likelihoods used as Scoring Matrices:

BLOSUM

ji

ijij pp

pS 2log2

Likelihood Ratio for Aligning a Single Pair of Residues

•Above: the probability that two residues are aligned by evolutionary descent

•Below: the probability that they are aligned by chance

•Pi, Pj are frequencies of residue i and j in all protein sequences (abundance)

ji

ijij pp

p

ji

jiS log

chance)by | withalignedPr(

ancestry)common | withalignedPr(log

Likelihood Ratio of Aligning Two Sequences

tjsiij

ji

ij

ji

ij Spp

p

pp

p

ji

ji

,

loglog

chance)by | withalignedPr(

ancestry)common | withalignedPr(log

)chanceby |alignmentPr(

)ancestrycommon |alignmentPr(log

alignment of ratiolik log

The alignment score of aligning two sequences is the log likelihood ratio of the alignment under two models Common ancestry By chance

PAM and BLOSUM matrices are all log likelihood matrices

More specificly: An alignment that scores 6 means

that the alignment by common ancestry is 2^(6/2)=8 times as likely as expected by chance.

ji

ijij pp

pS 2log2

BLOSUM matrices for Protein

S. Henikoff and J. Henikoff (1992). “Amino acid substitution matrices from protein blocks”. PNAS 89: 10915-10919

Training Data: ~2000 conserved blocks from BLOCKS database. Ungapped, aligned protein segments. Each block represents a conserved region of a protein family

Constructing BLOSUM Matrices of Specific Similarities

Sets of sequences have widely varying similarity. Sequences with above a threshold similarity are clustered.

If clustering threshold is 62%, final matrix is BLOSUM62

A toy example of constructing a BLOSUM matrix from 4

training sequences

Constructing a BLOSUM matr.1. Counting mutations

Constructing a BLOSUM matr.2. Tallying mutation frequencies

Constructing a BLOSUM matr.3. Matrix of mutation probs.

4. Calculate abundance of each residue (Marginal prob)

5. Obtaining a BLOSUM matrix

Constructing the real BLOSUM62 Matrix

1.2.3.Mutation Frequency Table

1000ijP

4. Calculate Amino Acid Abundance

acid aminoeach of likelihood marginal the: ip

5. Obtaining BLOSUM62 Matrix

ji

ijij pp

pS 2log2

PAM Matrices (Point Accepted Mutations)

Mutations accepted by natural selection

PAM Matrices Accepted Point Mutation Atlas of Protein Sequence and Structure,

Suppl 3, 1978, M.O. Dayhoff.

ed. National Biomedical Research Foundation, 1

Based on evolutionary principles

Constructing PAM Matrix: Training Data

PAM: Phylogenetic Tree

PAM: Accepted Point Mutation

Mutability

Total Mutation Rate

is the total mutation rate of all amino acids

Normalize Total Mutation Rate

Mutation Probability Matrix Normalized

Such that the Total Mutation Rate is 1%

Mutation Probability Matrix (transposed) M*10000

-- PAM1 mutation prob. matr. --PAM2 Mutation Probability Matrix?

-- Mutations that happen in twice the evolution period of that for a PAM1

)1(M)2(M

PAM Matrix: Assumptions

In two PAM1 periods: {AR} = {AA and AR} or {AN and NR} or {AD and DR} or … or {AV and VR}

period) 2ndin RPr(Dperiod)1st in DPr(A

period) 2ndin RPr(Nperiod)1st in NPr(A

period) 2ndin RPr(Aperiod)1st in A Pr(A

periods) 2in RAPr(

DRADNRANARAAAR PPPPPPP )2(

Entries in a PAM-2 Mut. Prob. Matr.

PAM-k Mutation Prob. Matrix

KK MM

MMM

}{ )1()(

)1()1()2(

PAM-1 log likelihood matrix

ji

ijij pp

PS

)1(

10log10

PAM-k log likelihood matrix

ji

kij

ijk

pp

PS

)(

10)( log10

PAM-250

PAM60—60%, PAM80—50%, PAM120—40% PAM-250 matrix provides a better

scoring alignment than lower-numbered PAM matrices for proteins of 14-27% similarity

Sources of Error in PAM

Comparing Scoring MatrixPAM

Based on extrapolation of a small evol. Period

Track evolutionary origins Homologous seq.s during

evolution

BLOSUM Based on a range of

evol. Periods Conserved blocks Find conserved

domains

Choice of Scoring Matrix

Global Alignment with Affine Gaps

Complex Dynamic Programming

Problem w/ Independent Gap Penalties The occurrence of x consecutive

deletions/insertions is more likely than the occurrence of x isolated mutations

We should penalize x long gap less than x

times of the penalty for one gap

Affine Gap Penalty

w2 is the penalty for each gap w1 is the _extra_ penalty for the

1st gap

Scoring Rule not Additive! We need to know if the current gap

is a new gap or the continuation of an existing gap

Use three Dynamic Programming matrices to keep track of the previous step

S1 is the vertical sequenceS2 is the horizontal sequence (From Diagonal) a(i,j): current position

is a match (From Left) b(i,j): current position is a

gap in S1 (From Above) c(i,j): current position is a

gap in S2Filling the next element in each matrix

depends on the previous step, which is stored in the three matrices.

Last step a match

a gap in S2

a gap in S1

new gap in S2

a continued gap in S2

a gap in S2 following a gap in S1

Decisions in Seq. Alignment Local or global alignment? Which program to use Type of scoring matrix Value of gap penalty

Aij*10

PAM-k log-likelihood matrix