Weighting training sequences

26
Probability estimation an d weights 1 Weighting training Weighting training sequences sequences Why do we want to weight training sequences? Many different proposals Based on trees Based on the 3D position of the sequences Interested only in classifying family membership Maximizing entropy

description

Weighting training sequences. Why do we want to weight training sequences? Many different proposals Based on trees Based on the 3D position of the sequences Interested only in classifying family membership Maximizing entropy. Why do we want to weight training sequences?. - PowerPoint PPT Presentation

Transcript of Weighting training sequences

Page 1: Weighting training sequences

Probability estimation and weights 1

Weighting training sequencesWeighting training sequences

Why do we want to weight training sequences?

Many different proposals – Based on trees– Based on the 3D position of the sequences– Interested only in classifying family

membership– Maximizing entropy

Page 2: Weighting training sequences

Probability estimation and weights 2

Why do we want to weight Why do we want to weight training sequences?training sequences?

Parts of sequences can be closely related to each other and don’t deserve the same influence in the estimation process as a sequence which is highly diverted.– Phylogenetic trees – Sequences AGAA, CCTC, AGTC

AGTC

AGAA CCTC

Page 3: Weighting training sequences

Probability estimation and weights 3

Weighting schemes based on Weighting schemes based on treestrees

Thompson, Higgins & Gibson (1994) (Represents electric currents as calculated by Kirchhoff’s laws)

Gerstein, Sonnhammer & Chothia (1994)Root weights from Gaussian parameters

(Altschul-Caroll-Lipman weights for a

three-leaf tree 1989)

Page 4: Weighting training sequences

Probability estimation and weights 4

Thompson, Higgins & GibsonThompson, Higgins & Gibson

5V

4V

3I

2I1I

1 2 3

3R

2R1R

4R21 II

Electric network of voltages, currents and resistances

Page 5: Weighting training sequences

Probability estimation and weights 5

Thompson, Higgins & GibsonThompson, Higgins & Gibson

5V

4V

3I

2I1I

21 II

1 2 3

4

22

3

214 22 IIV

32115 4)(32 IIIIV

2:1:1:: 321 III

Page 6: Weighting training sequences

Probability estimation and weights 6

Gerstein, Sonnhammer & Gerstein, Sonnhammer & ChothiaChothia

Works up the tree, incrementing the weights– Initially: weights are set to the edge lengths

(resistances in previous example)

n belowk leaves kw

ini

wtw

nt

Page 7: Weighting training sequences

Probability estimation and weights 7

Gerstein, Sonnhammer & Gerstein, Sonnhammer & ChothiaChothia

1 2 3

4

22

3

1

2

0

4,2,2 321 www

5.35.1221 ww

8:7:7:: 321 www

Page 8: Weighting training sequences

Probability estimation and weights 8

Gerstein, Sonnhammer & Gerstein, Sonnhammer & ChothiaChothia

Small difference with Thompson, Higgins & Gibson?

21

1 2

1:2: :G&HT, 21 II

2:1: :C&SG, 21 ww

Page 9: Weighting training sequences

Probability estimation and weights 9

Root weights from Gaussian Root weights from Gaussian parametersparameters

Continuous in stead of discrete members of an alphabet

Probability density in stead of a substitution matrix

Example: Gaussian

t

yxtyxP

2

)(exp)|(

2

Page 10: Weighting training sequences

Probability estimation and weights 10

Root weights from Gaussian Root weights from Gaussian parametersparameters

12

22211

1

2

22

1

21

1

21

2

)(exp

2

)(exp

2

)(exp

),|4 nodeat (

t

xvxvxK

t

xx

t

xxK

LLxP

v

)(,)( 21122121 tttvtttv

Page 11: Weighting training sequences

Probability estimation and weights 11

Root weights from Gaussian Root weights from Gaussian parametersparameters

Altschul-Caroll-Lipman weights for a tree with three leaves

123

2332211

2

321

2

)(exp

),,|5 nodeat (

t

xwxwxwxK

LLLxP

Page 12: Weighting training sequences

Probability estimation and weights 12

Root weights from Gaussian Root weights from Gaussian parametersparameters

/)}({

,/

/

214213

312

321

tttttw

ttw

ttw))(( 214321 tttttt

2:1:1:: 321 www

1 2 3

4

22

3

Page 13: Weighting training sequences

Probability estimation and weights 13

Weighting schemes based on Weighting schemes based on treestrees

Thompson, Higgins & Gibson (Electric current): 1:1:2

Gerstein, Sonnhammer & Chothia: 7:7:8Altschul-Caroll-Lipman weights for a tree

with three leaves: 1:1:2

Page 14: Weighting training sequences

Probability estimation and weights 14

Weighting scheme using Weighting scheme using ‘sequence space’‘sequence space’

Voronoi weights

=

=

k k

ii n

nw

Page 15: Weighting training sequences

Probability estimation and weights 15

More weighting schemesMore weighting schemes

Maximum discrimination weightsMaximum entropy weights

– Based on averaging– Based on maximum ‘uniformity’ (entropy)

Page 16: Weighting training sequences

Probability estimation and weights 16

Maximum discrimination weightsMaximum discrimination weights

Does not try to maximize likelihood or posterior probability

It decides whether a sequence is a member of a family

Page 17: Weighting training sequences

Probability estimation and weights 17

Maximum discrimination weightsMaximum discrimination weights

Discrimination D

Maximize D, emphasis is on distant or difficult members

k

kxMPD )|(

)()|()()|(

)()|()|(

RPRXPMPMxP

MPMxPxMP

Page 18: Weighting training sequences

Probability estimation and weights 18

Maximum discrimination weightsMaximum discrimination weights

Differences with previous systems– Iterative method

Initial weights give rise to a model New calculated posterior probabilities P(M|x) gives

rise to new weights and hence a new model until convergence is reached

– It optimizes performance for that what the model is designed for : classifying whether a sequence is a member of a family

Page 19: Weighting training sequences

Probability estimation and weights 19

More weighting schemesMore weighting schemes

Maximum discrimination weightsMaximum entropy weights

– Based on averaging– Based on maximum ‘uniformity’ (entropy)

Page 20: Weighting training sequences

Probability estimation and weights 20

Maximum entropy weightsMaximum entropy weightsEntropy = A measure of the average

uncertainty of an outcome (maximum when we are maximally uncertain about the outcome)

Averaging: i ixi

kki

kmw

1

iak

mw

ia

ik

column in typeof residues and

residuesdifferent total,weight

Page 21: Weighting training sequences

Probability estimation and weights 21

Maximum entropy weightsMaximum entropy weights

Sequences AGAA

CCTC

AGTC

C) and(A 21 m

1 ,2 11 CA kk

4

1

4

1

4

1

4

14

1

4

1

2

1

2

12

1

2

1

4

1

4

1

8

2

8

3

8

3

3

2

1

w

w

w

Page 22: Weighting training sequences

Probability estimation and weights 22

Maximum entropy weightsMaximum entropy weights

)(log)()( :Shannon ii

i xPxPXH

‘Uniformity’:

aiaiai

kk

ii

ppwH

wwH

)log()(

)(

Page 23: Weighting training sequences

Probability estimation and weights 23

Maximum entropy weightsMaximum entropy weightsSequences AGAA

CCTC

AGTC

)log()(log)(

)log()(log)(

log)log()()(

log)log()()(

3232114

3232113

2231312

2231311

wwwwwwwH

wwwwwwwH

wwwwwwwH

wwwwwwwH

Page 24: Weighting training sequences

Probability estimation and weights 24

Maximum entropy weightsMaximum entropy weights

Solving the equations leads to:

232

231

232

22

22

231

)()(

)()(

wwww

wwwwww

0 ,2

1 ,

2

1321 www

Page 25: Weighting training sequences

Probability estimation and weights 25

Summary of the entropy methodsSummary of the entropy methods

Maximum entropy weights (avaraging)

Maximum entropy weights (‘uniformity’)

8

2 ,

8

3 ,

8

3321 www

0 ,2

1 ,

2

1321 www

Page 26: Weighting training sequences

Probability estimation and weights 26

ConclusionConclusion

Many different methodsWhich one to use depends on problem

Questions??