Weighting training sequences

Probability estimation and weights 1

Weighting training sequencesWeighting training sequences

Why do we want to weight training sequences?

Many different proposals – Based on trees– Based on the 3D position of the sequences– Interested only in classifying family

membership– Maximizing entropy


Why do we want to weight Why do we want to weight training sequences?training sequences?

Parts of sequences can be closely related to each other and don’t deserve the same influence in the estimation process as a sequence which is highly diverted.– Phylogenetic trees – Sequences AGAA, CCTC, AGTC

AGTC

AGAA CCTC


Weighting schemes based on Weighting schemes based on treestrees

Thompson, Higgins & Gibson (1994) (Represents electric currents as calculated by Kirchhoff’s laws)

Gerstein, Sonnhammer & Chothia (1994)Root weights from Gaussian parameters

(Altschul-Caroll-Lipman weights for a

three-leaf tree 1989)


Thompson, Higgins & GibsonThompson, Higgins & Gibson

5V

4V

3I

2I1I

1 2 3

3R

2R1R

4R21 II

Electric network of voltages, currents and resistances


Thompson, Higgins & GibsonThompson, Higgins & Gibson

5V

4V

3I

2I1I

21 II

1 2 3

4

22

3

214 22 IIV

32115 4)(32 IIIIV

2:1:1:: 321 III


Gerstein, Sonnhammer & Gerstein, Sonnhammer & ChothiaChothia

Works up the tree, incrementing the weights– Initially: weights are set to the edge lengths

(resistances in previous example)

n belowk leaves kw

ini

wtw

nt



1 2 3

4

22

3

1

2

0

4,2,2 321 www

5.35.1221 ww

8:7:7:: 321 www



Small difference with Thompson, Higgins & Gibson?

21

1 2

1:2: :G&HT, 21 II

2:1: :C&SG, 21 ww


Root weights from Gaussian Root weights from Gaussian parametersparameters

Continuous in stead of discrete members of an alphabet

Probability density in stead of a substitution matrix

Example: Gaussian

t

yxtyxP

2

)(exp)|(

2



12

22211

1

2

22

1

21

1

21

2

)(exp

2

)(exp

2

)(exp

),|4 nodeat (

t

xvxvxK

t

xx

t

xxK

LLxP

v

)(,)( 21122121 tttvtttv



Altschul-Caroll-Lipman weights for a tree with three leaves

123

2332211

2

321

2

)(exp

),,|5 nodeat (

t

xwxwxwxK

LLLxP



/)}({

,/

/

214213

312

321

tttttw

ttw

ttw))(( 214321 tttttt

2:1:1:: 321 www

1 2 3

4

22

3


Weighting schemes based on Weighting schemes based on treestrees

Thompson, Higgins & Gibson (Electric current): 1:1:2

Gerstein, Sonnhammer & Chothia: 7:7:8Altschul-Caroll-Lipman weights for a tree

with three leaves: 1:1:2


Weighting scheme using Weighting scheme using ‘sequence space’‘sequence space’

Voronoi weights

=

=

k k

ii n

nw


More weighting schemesMore weighting schemes

Maximum discrimination weightsMaximum entropy weights

– Based on averaging– Based on maximum ‘uniformity’ (entropy)


Maximum discrimination weightsMaximum discrimination weights

Does not try to maximize likelihood or posterior probability

It decides whether a sequence is a member of a family



Discrimination D

Maximize D, emphasis is on distant or difficult members

k

kxMPD )|(

)()|()()|(

)()|()|(

RPRXPMPMxP

MPMxPxMP



Differences with previous systems– Iterative method

Initial weights give rise to a model New calculated posterior probabilities P(M|x) gives

rise to new weights and hence a new model until convergence is reached

– It optimizes performance for that what the model is designed for : classifying whether a sequence is a member of a family


More weighting schemesMore weighting schemes

Maximum discrimination weightsMaximum entropy weights

– Based on averaging– Based on maximum ‘uniformity’ (entropy)


Maximum entropy weightsMaximum entropy weightsEntropy = A measure of the average

uncertainty of an outcome (maximum when we are maximally uncertain about the outcome)

Averaging: i ixi

kki

kmw

1

iak

mw

ia

ik

column in typeof residues and

residuesdifferent total,weight


Maximum entropy weightsMaximum entropy weights

Sequences AGAA

CCTC

AGTC

C) and(A 21 m

1 ,2 11 CA kk

4

1

4

1

4

1

4

14

1

4

1

2

1

2

12

1

2

1

4

1

4

1

8

2

8

3

8

3

3

2

1

w

w

w



)(log)()( :Shannon ii

i xPxPXH

‘Uniformity’:

aiaiai

kk

ii

ppwH

wwH

)log()(

)(


Maximum entropy weightsMaximum entropy weightsSequences AGAA

CCTC

AGTC

)log()(log)(

)log()(log)(

log)log()()(

log)log()()(

3232114

3232113

2231312

2231311

wwwwwwwH

wwwwwwwH

wwwwwwwH

wwwwwwwH



Solving the equations leads to:

232

231

232

22

22

231

)()(

)()(

wwww

wwwwww

0 ,2

1 ,

2

1321 www


Summary of the entropy methodsSummary of the entropy methods

Maximum entropy weights (avaraging)

Maximum entropy weights (‘uniformity’)

8

2 ,

8

3 ,

8

3321 www

0 ,2

1 ,

2

1321 www


ConclusionConclusion

Many different methodsWhich one to use depends on problem

Questions??

Weighting training sequences

Documents

Transcript of Weighting training sequences