Magic Moments: Moment-based Approaches to Structured Output Prediction

Magic Moments:Moment-based Approaches toStructured Output Prediction

Elisa Riccijoint work with Nobuhisa Ueda, Tijl De Bie, Nello Cristianini

Thursday, October 25th

The Analysis of Patterns

Outline

Learning in structured output spaces

New algorithms based on Z-score

Experimental results and computational issues

Conclusions


Z-score


Conclusions

Structured data everywhere!!! Many problems involve highly

structured data which can be represented by sequences, trees and graphs.

Temporal, spatial and structural dependencies between objects are modeled.

This phenomenon is observed in several fields such as computational biology, computer vision, natural language processing or web data analysis.


Z-score


Conclusions

Learning with structured data


Z-score


Conclusions

Machine learning and data mining algorithms must be able to analyze efficiently and automatically a vast amount of complex and structured data.

The goal of structured learning algorithms is to predict complex structures, such as sequences, trees, or graphs.

Using traditional algorithms to cope with problems involving structured data often implies a loss of information about the structure.

Find s.t.

Supervised learning Data are available in form of examples and their associated correct answers.


Z-score


Conclusions

Learning:

Prediction:

Hypotheses

space

yx ii yyy ,,,,,, xxx 11T

yxh : Hh

1 iyh iix

Training set:

yh x on a new test sample x.

Classification A typical supervised learning task is classification.


Z-score


Conclusions

Named entity recognition (NER): locate named entities in text. Entities of interest are person names, location names, organization names, miscellaneous (dates, times...)

Label: entity tag.

Observed variable: word in a sentence.

PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida.

O N N N M m m N N N O L

Multiclass classification y

x

Sequence labeling


Z-score


Conclusions

Sequence labeling: given an input sequence x, reconstruct the associated label sequence y of equal length.

Label sequence: entity tags.

Observed sequence: words in a sentence.

Can we consider the interactions between adjacent words?

Goal: realize a joint labeling for all the words in the sentence.

PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida.

O N N N M m m N N N O L

y = (y1...yn)

x = (x1...xn)

Sequence alignment


Z-score


Conclusions

Biological sequence alignment is used to determine the similarity between biological sequences.

ACTGATTACGTGAACTGGATCCA

ACTC--TAGGTGAAGTG-ATCCA?

S ={A,T,G,C}, S1 , S2

S

Given two sequences S1, S2 S a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence.

ATGCTTTC------CTGTCGCC

S1 ATGCTTTCS2 CTGTCGCC

Sequence alignment: given a sequences pair x, predict the correct sequence y of alignment operations (e.g. matches, mismatches, gaps).

Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph.

Sequence alignment


Z-score


Conclusions

ATGCTTTC------CTGTCGCCy


x

A T G C T T T CCTGTCGCC

RNA secondary structure prediction


Z-score


Conclusions

RNA secondary structure prediction: given a RNA sequence, predict the most likely secondary structure.

The study of RNA structure is important in understanding its functions.

AUGAGUAUAAGUUAAUGGUUAAAGUAAAUGUCUUCCACACAUUCCAUCUGAUUUCGAUUCUCACUACUCAU

?

Sequence parsing


Z-score


Conclusions

Sequence parsing: given an input sequence x, determine the associated parse tree y given an underlying context-free grammar.

Example:

y

GAUCGAUCGAUCx

SS

S

SS

S

UC

GGA

A

C

U

SSS

GA

C

U

Context-free grammar G={V, A, R, S}

V ={S} set of non-terminals symbols

S = {G, A, U, C} set of terminals symbols

R= {S → SS | GSC | CSG | ASU | USA | }.

Traditionally HMMs have been used for sequence labeling.

Two main drawbacks: The conditional independence assumptions are often too restrictive. HMMs cannot represent multiple interacting features or long range dependencies between the observations.They are typically trained by maximum likelihood (ML) estimation.

Label sequence y = (y1...yn) Observed sequence x = (x1...xn)

y1 y2 y3

x1 x2 x3

Generative models


Z-score


Conclusions

Sequence labeling:

Discriminative models

Specify the probability of possible output y given an observation x (consider conditional probability P(y|x) rather than joint probability P(y,x)).

Do not require strict independence assumptions of generative models.

Arbitrary features of the observations are considered.

Conditional Random Fields (CRFs) [Lafferty et al., 01]


Z-score


Conclusions

y

x

y1 y2 y3

x1 x2 x3



Z-score


Conclusions

Several discriminative algorithms have emerged recently in order to predict complex structures, such as sequences, trees, or graphs.

New discriminative approaches.

Problems analyzed: Given a training set of correct pairs of sentences and their associated entity tags learn

to extract entities from a new sentence. Given a training set of correct biological alignments learn to align two unknown

sequences. Given a training set of corrects RNA secondary structures associated to a set of

sequences learn to determine the secondary structure of a new sequence.

This is not an exhaustive list of possible applications.

Find s.t.

Learning in structured output spaces Multilabel supervised classification (Output: y = (y1...yn)).


Z-score


Conclusions

Learning:

Prediction:

Hypotheses

space

yx ii yxyxyx ,,,,,, 11T

yxh : Hh

1 ih ii yx

Training set:

yx h on a new test sample x. yxxy

,sy

h

max argScore

dT Rs wyxwyx ,,,

Three main phases:

Encoding: define a suitable feature map (x,y).

Compression: characterize the output space in a synthetic and compact way.

Optimization: define a suitable objective function and use it for learning.



Z-score


Conclusions






Z-score


Conclusions

Encoding


Z-score


Conclusions

Features must be defined in a way such that prediction can be computed efficiently.

The feature vector (x,y) decomposes as sum of elementary features (x,y) on “parts”.

Parts are typically edges or nodes in graphs.

yxwxy

,T

yh

max arg

is typically

huge.

y



Encoding


Z-score


Conclusions

ykkkktpz zpzyIpyIyy ,, 11

yxkkkkepq pqpyIqxIyx ,,

y1 y2 y3

x1 x2 x3

In general features reflect long range interactions (when labeling xi past and future observations are taken into account).

Arbitrary features of the observations are considered (e.g. spelling properties in NER).

Sequence labeling:

Example: CRF with HMM features

3-parameters model:

In practice more complex models are used:

4-parameters model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty) in a given position or if it continues (gap extension penalty).

211/212-parameters model: (x,y) contains the statistics associated to the gap penalties and all the possible pairs of amino acids.

Encoding


Z-score


Conclusions


#matches #mismatches #gaps

yx,

4 1 4

Sequence alignment:

Encoding


Z-score


Conclusions

2

2

2

1

1

1

yx,

S → SS S → GSC

S → CSG

S → ASU

S → USA

S → .

Sequence parsing:

y

GAUCGAUCGAUCx

SS

S

SS

S

UC

GGA

A

C

U

SSS

GA

C

U

The feature vector contains the statistics associated to the occurrences of the rules.

Encoding Having defined these features, predictions can be computed efficiently with dynamic

programming (DP).

Sequence labeling Viterbi algorithm

Sequence alignment Needleman-Wunsch algorithm

Sequence parsing Cocke-Younger-Kasami (CYK) algorithm


Z-score


Conclusions


A T G C T T T C

CTGTCGCC

DP TABLE






Z-score


Conclusions

Computing moments


Z-score


Conclusions

The number N of possible output vector yk given an observation x is typically huge.

To characterize the distribution of the scores its mean and its variance are considered.

C and can be computed efficiently with DP techniques.

ΤN

kk

ΤN

kk N

sN

wyxwyxyx

11

11,,,

Cwwwyxyxwyx ΤN

k

Tkk

Τ

N

1

2 1 ,,,

Input: x = (x1, x2, ..., xn), p, q.

(i, 1) := 1 i

if (q = x1) and (p = i),

for j = 2 to n

for i = 1 to

M := 0

if (q = x1) and (p = i), M := 1

endfor

endfor

Output:

i

jiji 1: ,,

111

:

ji

jiMjiji i

epqe

pq ,

,,,

i

i

epq

ni

nini

,

,,

1:1 ,iepq

y

Computing moments


Z-score


Conclusions

The number N of possible label sequences yk given an observation sequence x is exponential in the length of the sequences.

An algorithm similar to the forward algorithm is used to compute and C.

Recursive formulaSequence labeling:

Mean value associated to the feature which represents the emission of a symbol q at state p.

kkepq yx ,

y1 y2 y3

x1 x2 x3

Computing moments


Z-score


Conclusions

k

k

ii

k

ii aEaEaE

1

11

21

1

21

1

2

1

2 k

k

iik

k

ii

k

ii aEaEaaEaE

Basic idea behind recursive formulas:

Mean values are computed considering:

Variances are computed centering the second order moments:

Computing moments


Z-score


Conclusions

Problem: high computational cost for large feature spaces. 1st Solution: Exploit the structure and the sparseness of the covariance matrix C.

In sequence labeling for CRF with HMM features the number of different values in C is linear in the size of the observation alphabet.

2nd Solution: Sampling strategy.

Example:

34 yx






Z-score


Conclusions

Z-score


Z-score


Conclusions

New optimization criterion particularly suited for non-separable cases.

Minimize the number of output vectors with score higher than the score of the correct pairs.

Maximize the Z-score:

yx

yxyxx

,,,

s

Z

Z-score


Z-score


Conclusions

The Z-score can be expressed as a function of the parameters w.

Two equivalent optimization problems:

Cww

bww T

T

max

kkT

N

kkN

1s.t.

1min

1

2

yxyxw

w

,,

Cww

bw

Cww

yxw

yx

yxyxx

T

T

T

TsZ

,,

,,

Z-score


Z-score


Conclusions

Ranking loss:

An upper bound on the ranking loss is minimized:

The number of output vectors with score higher than the score of the correct pairs is minimized.

yy

yxyxyxk

kssIN

rk ,,,L 1

yxyxyxwyxyy

,,,, LL rkurk

k

kT

N

kk NN

2

1

2 111

Previous approaches

Minimize the number of incorrect macrolabels y.

CRFs [Lafferty et al., 01], HMSVM [Altun at al., 03], averaged perceptron [Collins 02].

Minimize the number of incorrect microlabels y.

M3Ns [Taskar et al., 03], SVMISO [Tsochantaridis et al., 04].

yxyx hI/ ,L 10

jj yhIhm xyx,L


Z-score


Conclusions

SODA


Z-score


Conclusions

Given a training set T the empirical risk associated to the upper-bound on the ranking loss is minimized.

An equivalent formulation in terms of C and b is considered to solve it .

wCw

bw

wbbCw

bw

w

T

T

i

Tiii

T

ii

T

1

1maxSODA (Structured Output

Discriminant Analysis)

i

N

ikik

iiiT

i

N

kik

i

i

1s.t.

1min

1 1

2

yxyxw

w

,,

SODA Convex optimization:

If C* is not PSD, regularization can be introduced.

Solution: simple matrix inversion .

Fast conjugate gradient methods available.


Z-score


Conclusions

1s.t.

minmax

bw

wCw

wCw

bww

w T

T

T

T

bCw1

Rademacher bound The bound shows that learning based on the upper bound on the ranking loss is

effectively achieved.

The bound holds also in the case where b* and C* are estimated by sampling.

Two directions of sampling: For each only a limited number n of incorrect outputs is considered to

estimate b* and C*.

Only a finite number ℓ of input-output pairs is given in the training set.

The empirical expectation of the estimated loss (estimated by computing b* and C* by random sampling) is a good approximate upper bound for the expected loss .

The latter is an upper bound for the ranking loss , such that the Rademacher bound is also a bound on the expectation of the ranking loss.

yxyx ,E urkL,


Z-score


Conclusions

Tyx,

yxyx ,E urkL̂,ˆ

yx,rkL

Rademacher bound

Theorem (Rademacher bound for SODA). With probability at least 1-over the joint of therandom sample T and the random samples from the output space for each that aretaken to approximate the matrices b* and C*, the following bound holds for any w with squarednorm smaller than c:

whereby M is a constant and we assume that the number of random samples for each trainingpair is equal to n.The Rademacher complexity terms and decrease with and respectively, suchthat the bound becomes tight for increasing n and ℓ, as long as n grows faster than log(ℓ).


Z-score


Conclusions

2

2log3

2

2log3

21

//

ˆˆˆˆ,,,,

ˆ

Mn

M

E,E,E ,urkurk

yxyxyxyx yxyx LL

yx,,ˆ

1 2̂ 1

n

1

Tyx,

Z-score approach


Z-score


Conclusions

How to define the Z-score of a training set? Another possible approach (independence assumption):

Convex optimization problem which can be solved again by simple matrix inversion.

Maximizing the Z-score most linear constraints

are satisfied.

wCw

bw

wCw

bw

w T

T

ii

T

ii

T

1

1max Z-score approach

iik

ik

iTiiT i yyyxwyxw ,,,,, 21

One may want to impose explicitly the violated constraints.

This is again a convex optimization problem that can be solved with an iterative algorithm similar to previous approaches (HMSVM [Altun at al., 03], averaged perceptron [Collins 02]).

Eventually relax constraints (e.g. add slack variables for non separable problems).

Iterative approach


Z-score


Conclusions

QP1s.t.

min

bw

wCww

T

T

iik

ik

iTiiT i yyyxwyxw ,,,,, 21

Input: training set T

1: C ← ø2: Compute bi, Ci for all i=1…ℓ3: Compute =sum(bi), =sum(Ci)4: Find wsolving QP.5: Repeat6: for i=1…ℓ do7: Compute yi’=argmaxy wT(xi, yi) 8: if wT(xi, yi’) >wT(xi, ) 9: C ← C U wT((xi, )- (xi, yi’) )> }10: Find wsolving QP s.t. C11: endif 12: endfor13: until C is not changed in during the current iteration.

iy

b C

iy

Iterative approach


Z-score


Conclusions

Moments computation

Z-score maximization

Constrained Z-score maximization

Identify the most violated constraint

Experimental results

Chain CRF with HMM features.Sequence length: 50. Training set size: 20 pairs. Test set size: 100 pairs. Comparison with SVMISO [Tsochantaridis et al., 04], Perceptron [Collins 02], CRFs [Lafferty et al., 01].Average number of incorrect labels varying the level of noise p.

24 yx


Z-score


Conclusions

Sequence labeling: artificial data.

35 yx

0 0.2 0.4 0.660

70

80

90

100

p

Tes

t e

rro

r

SODAPerceptronSVMISOCRFsZ-score

0 0.2 0.4 0.630

40

50

60

70

80

90

p

Tes

t e

rro

r

CRFsSVMISOPerceptronSODAZ-score

HMM features ( ).Noise level p=0.2.Average number of incorrect labels and computational time as function of the training set size.


35 yx


Z-score


Conclusions


5 10 15 20 25 3030

32

34

36

38

40

42

44

Training set size

Tes

t e

rro

r

CRFsSVMISOPerceptronSODA

20 40 60 80 1000

10

20

30

40

50

60

70

Training set size

Tim

e

SODASVMISO

Chain CRF with HMM features ( ).Sequence length: 10. Training set size: 50 pairs. Test set size: 100 pairs. Level of noise p=0.2Comparison with SVMISO [Tsochantaridis et al., 04].Labeling error on test set and average training time as function of the observation alphabet size.

2 4 6 80

2

4

6

8

10

12

14

Tim

e (s

ec)

SODA (50 paths)SODA (200 paths)SODA (DP)SVMISO


3 y

2 4 6 80

10

20

30

40

Tes

t e

rro

r

SODA (50 paths)SODA (200 paths)SODA (DP)SVMISO

x


Z-score


Conclusions


x


Chain CRF with HMM features ( ).

Adding constraints is not very useful when data are noisy and non linearly separable.

0 20 40 60 80 1000

20

40

60

80

100

Number of constraintsAve

rag

e n

um

ber

of c

orr

ect

hid

den

seq

uen

ces

(%)

Z-score (constr)SVMISOPerceptron


Z-score


Conclusions


24 yx



Z-score


Conclusions

Sequence labeling:

NER

Spanish news wire article - Special Session of CoNLL02

300 sentences with average length of 30 words.9 labels: non-name, beginning and continuation

of persons, organizations, locations and miscellaneous names.

Two sets of binary features: S1 (HMM features) and S2 (S1 and HMM features for the previous and the next word).

Labeling error on test set (5-fold crossvalidation)

Method S1 S2

Z-score 11.07 7.89

SODA 10.13 8.27

SVMISO 10.97 8.11

Perceptron 20.99 13.78

CRFs 12.01 8.29



Z-score


Conclusions

Sequence alignment: artificial sequences.

5 10 20 50 100

SODA 78.6 62.85 44.6 36.7 30.84

Generative 96.4 94.39 87.12 45.31 31.05

Test error (number of incorrectly aligned pairs) as function of the training set size.

5 10 15 20

2

4

6

8

10

12

14

16

18

20

5 10 15 20

2

4

6

8

10

12

14

16

18

20

Original and reconstructed substitution matrices.



Z-score


Conclusions

Sequence parsing:

G6 grammar in [Dowell and Eddy, 2004].RNA sequences of five families extracted from the Rfam database [Griffiths-Jones et al., 2003]

Prediction on five-fold crossvalidation.

Z-score with constraints Generative Perceptron

sensitivity specificity constraints sensitivity specificity sensitivity specificity

RF00032 100 95.98 2 100 95.53 100 95.59

RF00260 98.77 94.80 6 98.97 100 98.57 98.90

RF00436 91.11 90.61 27.6 44.16 53.30 90.27 86.53

RF00164 76.14 73.74 37.8 65.51 62.55 87.06 78.32

RF00480 99.08 89.89 78.2 99.88 86.43 98.83 94.78

Conclusions


Z-score


Conclusions

New methods for learning in structured output spaces. Accuracy comparable with state-of-the-art techniques. Easy to implement (DP for matrix computations and simple optimization problem). Fast for large training set and reasonable number of features.

• Mean and variance computations parallelizable for large training set.• Conjugate gradient techniques used in the optimization phase.

Three application analyzed: sequence labeling, sequence parsing and sequence alignment.

Future works: Test the scalability of this approach using approximate techniques. Develop a dual version with kernels.


Z-score


Conclusions

Thank you

Magic Moments: Moment-based Approaches to Structured Output Prediction

Documents

Transcript of Magic Moments: Moment-based Approaches to Structured Output Prediction