Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell...

44
Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with T. Hofmann, I. Tsochantaridis, Y. Altun (Brown/Google/TTI) T. Finley, R. Elber, Chun-Nam Yu, Yisong Yue, F. Radlinski

Transcript of Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell...

Page 1: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Structured Output Prediction with

Structural Support Vector Machines

Thorsten Joachims

Cornell University

Department of Computer Science

Joint work with T. Hofmann, I. Tsochantaridis, Y. Altun (Brown/Google/TTI) T. Finley, R. Elber, Chun-Nam Yu, Yisong Yue, F. Radlinski

P. Zigoris, D. Fleisher (Cornell)

Page 2: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Supervised Learning• Assume: Data is i.i.d. from

• Given: Training sample

• Goal: Find function from input space X to output space Y

with low risk / prediction error

• Methods: Kernel Methods, SVM, Boosting, etc.

Complex objects

Page 3: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Examples of Complex Output Spaces

• Natural Language Parsing

– Given a sequence of words x, predict the parse tree y.

– Dependencies from structural constraints, since y has to be a tree.

The dog chased the catx

SVPNP

Det NV

NP

Det N

y

Page 4: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Examples of Complex Output Spaces

• Protein Sequence Alignment

– Given two sequences x=(s,t), predict an alignment y.

– Structural dependencies, since prediction has to be a valid global/local alignment.

x y

AB-JLHBNJYAUGAI

BHJK-BN-YGU

s=(ABJLHBNJYAUGAI)

t=(BHJKBNYGU)

Page 5: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Examples of Complex Output Spaces

• Information Retrieval

– Given a query x, predict a ranking y.

– Dependencies between results (e.g. avoid redundant hits)

– Loss function over rankings (e.g. AvgPrec)

SVMx 1. Kernel-Machines

2. SVM-Light3. Learning with Kernels4. SV Meppen Fan Club5. Service Master & Co.6. School of Volunteer

Management7. SV Mattersburg Online…

y

Page 6: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Examples of Complex Output Spaces

• Noun-Phrase Co-reference

– Given a set of noun phrases x, predict a clustering y.

– Structural dependencies, since prediction has to be an equivalence relation.

– Correlation dependencies from interactions.x y

The policeman fed

the cat. He did not

know that he was late.

The cat is called Peter.

The policeman fed

the cat. He did not

know that he was late.

The cat is called Peter.

Page 7: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Examples of Complex Output Spaces

• and many many more:– Sequence labeling (e.g. part-of-speech tagging, named-

entity recognition) [Lafferty et al. 01, Altun et al. 03] – Collective classification (e.g. hyperlinked documents)

[Taskar et al. 03]– Multi-label classification (e.g. text classification) [Finley

& Joachims 08]– Binary classification with non-linear performance

measures (e.g. optimizing F1-score, avg. precision) [Joachims 05]

– Inverse reinforcement learning / planning (i.e. learn reward function to predict action sequences) [Abbeel & Ng 04]

Page 8: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Overview• Task: Discriminative learning with complex outputs

• Related Work

• SVM algorithm for complex outputs

– Predict trees, sequences, equivalence relations, alignments

– General non-linear loss functions

– Generic formulation as convex quadratic program

• Training algorithms

– n-slack vs. 1-slack formulation

– Correctness and sparsity bound

• Applications

– Sequence alignment for protein structure prediction [w/ Chun-Nam Yu]

– Diversification of retrieval results in search engines [w/ Yisong Yue]

– Supervised clustering [w/ Thomas Finley]

• Conclusions

Page 9: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Why Discriminative Learning for Structured Outputs?

• Important applications for which conventional methods don’t fit!

– Diversified retrieval [Carbonell & Goldstein 98] [Chen & Karger 06]

– Directly optimize complex loss functions (e.g. F1, AvgPrec)

• Direct modeling of problem instead of reduction!

– Noun-phrase co-reference: two step approach of pair-wise classification and clustering as post processing (e.g. [Ng & Cardie, 2002])

• Improve upon prediction accuracy of existing generative methods!

– Natural language parsing: generative models like probabilistic context-free grammars

– SVM outperforms naïve Bayes for text classification [Joachims, 1998] [Dumais et al., 1998]

• More flexible models!

– Avoid generative (independence) assumptions

– Kernels for structured input spaces and non-linear functions

Precision/Recall Break-Even Point

Naïve Bayes Linear SVM

Reuters 72.1 87.5

WebKB 82.0 90.3

Ohsumed 62.4 71.6

Page 10: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Related Work• Generative training (i.e. model P(Y,X))

– Hidden-Markov models– Probabilistic context-free grammars– Markov random fields– etc.

• Discriminative training (i.e. model P(Y|X) or minimize risk)– Multivariate output regression [Izeman, 1975] [Breiman & Friedman,

1997]– Kernel Dependency Estimation [Weston et al. 2003]– Transformer networks [LeCun et al, 1998]– Conditional HMM [Krogh, 1994]– Conditional random fields [Lafferty et al., 2001]– Perceptron training of HMM [Collins, 2002]– Maximum-margin Markov networks [Taskar et al., 2003]– Structural SVMs [Altun et al. 03] [Joachims 03] [TsoHoJoAl04]

Page 11: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Overview• Task: Discriminative learning with complex outputs

• Related Work

• SVM algorithm for complex outputs

– Predict trees, sequences, equivalence relations, alignments

– General non-linear loss functions

– Generic formulation as convex quadratic program

• Training algorithms

– n-slack vs. 1-slack formulation

– Correctness and sparsity bound

• Applications

– Sequence alignment for protein structure prediction [w/ Chun-Nam Yu]

– Diversification of retrieval results in search engines [w/ Yisong Yue]

– Supervised clustering [w/ Thomas Finley]

• Conclusions

Page 12: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Classification SVM [Vapnik et al.]

• Training Examples:

• Hypothesis Space:

• Training: Find hyperplane with minimal

Hard Margin(separable)

Soft Margin(training error)

Dual Opt. Problem:

Primal Opt. Problem:

Page 13: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Challenges in Discriminative Learning with Complex Outputs

• Approach: view as multi-class classification task– Every complex output is one class

• Problems:– Exponentially many classes!

• How to predict efficiently?

• How to learn efficiently?

– Potentially huge model!• Manageable number of features?

The dog chased the catxS VPNP

Det NV

NP

Det N

y2

S VPVP

Det NV

NP

V N

y1

S

NPVP

Det NV

NP

Det N

yk

Page 14: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Multi-Class SVM [Crammer & Singer]

• Training Examples:

• Hypothesis Space:

The dog chased the catx

S VPNP

Det NV

NP

Det N

y1

S VPVP

Det NV

NP

V N

y2

S

NPVP

Det NV

NP

Det N

y58

S VPNP

Det NV

NP

Det N

y12

S VPNP

Det NV

NP

Det N

y34

S VPNP

Det NV

NP

Det N

y4

Training: Find that solve

Problems• How to predict efficiently?• How to learn efficiently?• Manageable number of parameters?

Page 15: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Joint Feature Map

The dog chased the catx

S VPNP

Det NV

NP

Det N

y1

S VPVP

Det NV

NP

V N

y2

S

NPVP

Det NV

NP

Det N

y58

S VPNP

Det NV

NP

Det N

y12

S VPNP

Det NV

NP

Det N

y34

S VPNP

Det NV

NP

Det N

y4

• Feature vector that describes match between x and y

• Learn single weight vector and rank by

Problems• How to predict efficiently?• How to learn efficiently?• Manageable number of parameters?

Page 16: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Joint Feature Map for Trees• Weighted Context Free Grammar

– Each rule (e.g. ) has a weight

– Score of a tree is the sum of its weights

– Find highest scoring tree

The dog chased the cat

SVPNP

Det NV

NP

Det N

The catthechaseddog catNchasedV

dogNtheDetdogDet

NPVVPNDetNP

NPSVPNPS

11120

1201

),(

yx

x

y

YXf :

CKY ParserVPNPS

Problems• How to predict efficiently?• How to learn efficiently?• Manageable number of parameters?

Page 17: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Structural Support Vector Machine

• Joint features describe match between x and y

• Learn weights so that is max for correct y

Hard-margin optimization problem:

Page 18: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Loss Functions: Soft-Margin Struct SVM

• Loss function measures match between target and prediction.

…Lemma: The training loss is upper bounded by

Soft-margin optimization problem:

Page 19: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Experiment: Natural Language Parsing

• Implemention– Incorporated modified version of Mark Johnson’s CKY parser

– Learned weighted CFG with

• Data– Penn Treebank sentences of length at most 10 (start with POS)

– Train on Sections 2-22: 4098 sentences

– Test on Section 23: 163 sentences

– more complex features [TaKlCoKoMa04]

[TsoJoHoAl04]

Page 20: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Generic Structural SVM• Application Specific Design of Model

– Loss function

– Representation

Markov Random Fields [Lafferty et al. 01, Taskar et al. 04]

• Prediction:

• Training:

• Applications: Parsing, Sequence Alignment, Clustering, etc.

Page 21: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Reformulation of the Structural SVM QP

n-Slack Formulation: [TsoJoHoAl04]

Page 22: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Reformulation of the Structural SVM QP

1-Slack Formulation:

n-Slack Formulation:

[JoFinYu08]

[TsoJoHoAl04]

Page 23: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Cutting-Plane Algorithm for Structural SVM (1-Slack Formulation)

• Input:

• REPEAT– FOR

– Compute

– ENDFOR

– IF

– optimize StructSVM over

– ENDIF

• UNTIL has not changed during iteration

_

[Jo06] [JoFinYu08]

Add constraint to working set

Find most violated

constraint

Violated by more than ?

Page 24: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Polynomial Sparsity Bound

• Theorem: The cutting-plane algorithm finds a solution to the Structural SVM soft-margin optimization problem in the 1-slack formulation after adding at most

constraints to the working set S, so that the primal constraints are feasible up to a precision and the objective on S is optimal. The loss has to be bounded , and .

CR

CR

2

22

16

4log

[Jo03] [Jo06] [TeoLeSmVi07] [JoFinYu08]

Page 25: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Empirical Comparison: Different Formulations

Experiment Setup: – Part-of-speech tagging on Penn Treebank corpus

– ~36,000 examples, ~250,000 features in linear HMM model

[JoFinYu08]

Page 26: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Applying StructSVM to New Problem

• General

– SVM-struct algorithm and implementationhttp://svmlight.joachims.org

– Theory (e.g. training-time linear in n)

• Application specific

– Loss function

– Representation

– Algorithms to compute

• Properties

– General framework for discriminative learning

– Direct modeling, not reduction to classification/regression

– “Plug-and-play”

Page 27: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Overview• Task: Discriminative learning with complex outputs

• Related Work

• SVM algorithm for complex outputs

– Predict trees, sequences, equivalence relations, alignments

– General non-linear loss functions

– Generic formulation as convex quadratic program

• Training algorithms

– n-slack vs. 1-slack formulation

– Correctness and sparsity bound

• Applications

– Sequence alignment for protein structure prediction [w/ Chun-Nam Yu]

– Diversification of retrieval results in search engines [w/ Yisong Yue]

– Supervised clustering [w/ Thomas Finley]

• Conclusions

Page 28: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Comparative Modeling of Protein Structure

• Goal: Predict structure from sequenceh(“APPGEAYLQV”)

• Hypothesis: – Amino Acid sequences for into structure with lowest energy– Problem: Huge search space (> 2100 states)

• Approach: Comparative Modeling– Similar protein sequences fold into similar shapes

use known shapes as templates– Task 1: Find a similar known protein for a new protein

h(“APPGEAYLQV”, ) yes/no

– Task 2: Map new protein into known structure h(“APPGEAYLQV”, )

[A3,P4,P7,…]

– Task 3: Refine structure [Jo03, JoElGa05,YuJoEl06]

Page 29: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Linear Score Sequence Alignment

Method: Find alignment y that maximizes linear score

Example:– Sequences:

s=(A B C D)t=(B A C C)

– Alignment y1:A B C DB A C C score(x=(s,t),y1) = 0+0+10-10 = 0

– Alignment y2:- A B C DB A C C - score(x=(s,t),y2) = -5+10+5+10-5 =

15Algorithm: Solve argmax via dynamic programming.

A B C D -A 10 0 -5 -10 -5B 0 10 5 -10 -5C -5 5 10 -10 -5D -10 -10 -10 10 -5- -5 -5 -5 -5 -5

Page 30: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Predicting an Alignment

Protein Sequence to Structure Alignment (Threading)

– Given a pair x=(s,t) of new sequence s and known structure t, predict the alignment y.

– Elements of s and t are described by features, not just character identity.

x yββ-βλλββλλααααα 32-401450143520 AB-JLHBNJYAUGAI

BHJK-BN-YGU ββλλ-ββ-λλα

βββλλββλλααααα 32401450143520 ABJLHBNJYAUGAI

BHJKBNYGU ββλλββλλα

( )( )( )( )

[YuJoEl07]

Page 31: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Scoring Function for Vector SequencesGeneral form of linear scoring function:

match/gap score can be arbitrary linear function argmax can still be computed efficiently via dynamic

programming

Estimation:– Generative estimation (e.g. log-odds, hidden Markov model)

– Discriminative estimation via structural SVM[YuJoEl07]

Page 32: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Loss Function and Separation Oracle• Loss function:

– Q loss: fraction of incorrect alignments• Correct alignment y=

ΔQ(y,y’)=1/3

• Alternate alignment y’=

– Q4 loss: fraction of incorrect alignments outside window• Correct alignment y=

ΔQ4(y,y’)=0/3

• Alternate alignment y’=

• Separation oracle: – Same dynamic programming algorithms as alignment

- A B C DB A C C -

A - B C DB A C C -

- A B C DB A C C -

A - B C DB A C C -

[YuJoEl07]

Page 33: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Experiment

• Train set [Qiu & Elber]:

– 5119 structural alignments for training, 5169 structural alignments for validation of regularization parameter C

• Test set:

– 29764 structural alignments from new deposits to PDB from June 2005 to June 2006.

– All structural alignments produced by the program CE by superimposing the 3D coordinates of the proteins structures. All alignments have CE Z-score greater than 4.5.

• Features (known for structure, SABLE predictions for sequence):

– Amino acid identity (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)

– Secondary structure (α,β,λ)

– Exposed surface area (0,1,2,3,4,5)

[YuJoEl07]

Page 34: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Experiment ResultsModels:• Simple: Ф(s,t,yi) (A|A; A|C; …;-|Y; α|α; α|β…; 0|0; 0|1;…)

• Anova2: Ф(s,t,yi) (Aα|Aα…; α0|α0…; A0|A0;…)

• Tensor: Ф(s,t,yi) (Aα0|Aα0; Aα0|Aα1; …)

• Window: Ф(s,t,yi) (AAA|AAA; …; ααααα|ααααα; …; 00000|00000;…)

Q-Score # Features Test

Simple 1020 39.89

Anova2 49634 44.98

Tensor 203280 42.81

Window 447016 46.30Q-score when optimizing to Q-loss

[YuJoEl07]

Q4-score Test

BLAST 28.44

SVM (Window) 70.71

SSALN [QiuElber] 67.30

TM-align [ZhaSko] (85.32)Q4-score when optimizing to Q4-loss

Ability to train complex models? Comparison against other methods?

Page 35: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Overview• Task: Discriminative learning with complex outputs

• Related Work

• SVM algorithm for complex outputs

– Predict trees, sequences, equivalence relations, alignments

– General non-linear loss functions

– Generic formulation as convex quadratic program

• Training algorithms

– n-slack vs. 1-slack formulation

– Correctness and sparsity bound

• Applications

– Sequence alignment for protein structure prediction [w/ Chun-Nam Yu]

– Diversification of retrieval results in search engines [w/ Yisong Yue]

– Supervised clustering [w/ Thomas Finley]

• Conclusions

Page 36: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Diversified Retrieval• Ambiguous queries:

– Example query: “SVM” • ML method• Service Master Company• Magazine• School of veterinary medicine• Sport Verein Meppen e.V.• SVM software• SVM books

– “submodular” performance measure make sure each user gets at least

one relevant result

• Learning Queries:– Find all information about a topic– Eliminate redundant information

Query: SVM

1. Kernel Machines

2. SVM book

3. SVM-light

4. libSVM

5. Intro to SVMs

6. SVM application list

7. …

Query: SVM

1. Kernel Machines

2. Service Master Co

3. SV Meppen

4. UArizona Vet. Med.

5. SVM-light

6. Intro to SVM

7. …[YueJo08]

Page 37: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Approach

• Prediction Problem:– Given set x, predict size k subset y that satisfies most users.

• Approach: Topic Red. ¼ Word Red. [SwMaKi08]

– Weighted Max Coverage:

– Greedy algorithm is 1-1/e approximation [Khuller et al 97]

Learn the benefit weights:

D6

D5

D7x

y = { D1, D2, D3, D4 }

[YueJo08]

D4D3

D2D1

Page 38: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Features Describing Word Importance

• How important is it to cover word w• w occurs in at least X% of the documents in x• w occurs in at least X% of the titles of the documents in x• w is among the top 3 TFIDF words of X% of the documents in x• w is a verb

Each defines a feature in• How well a document d covers word w

• w occurs in d• w occurs at least k times in d• w occurs in the title of d• w is among the top k TFIDF words in d

Each defines a separate vocabulary and scoring function

[YueJo08]

D6D3D5D1 D4

D7D2

D6D3D5D1 D4

D7D2

D6D3D5D1 D4

D7D2

+ + … +

Page 39: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Loss Function and Separation Oracle

• Loss function: – Popularity-weighted percentage of subtopics not covered in y

More costly to miss popular topics

– Example:

• Separation oracle:– Again a weighted max coverage problem

add artificial word for each subtopic with percentage weight

– Greedy algorithm is 1-1/e approximation [Khuller et al 97]

[YueJo08]

D1 D9

D7D2

D4

D10D3

D12

D11D8

D6

Page 40: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Experiments

• Data: – TREC 6-8 Interactive Track

– Relevant documents manually labeled by subtopic

– 17 queries (~700 documents), 12/4/1 training/validation/test

– Subset size k=5, two feature sets (div, div2)

• Results:

Page 41: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Overview• Task: Discriminative learning with complex outputs

• Related Work

• SVM algorithm for complex outputs

– Predict trees, sequences, equivalence relations, alignments

– General non-linear loss functions

– Generic formulation as convex quadratic program

• Training algorithms

– n-slack vs. 1-slack formulation

– Correctness and sparsity bound

• Applications

– Sequence alignment for protein structure prediction [w/ Chun-Nam Yu]

– Diversification of retrieval results in search engines [w/ Yisong Yue]

– Supervised clustering [w/ Thomas Finley]

• Conclusions

Page 42: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Learning to Cluster

• Noun-Phrase Co-reference

– Given a set of noun phrases x, predict a clustering y.

– Structural dependencies, since prediction has to be an equivalence relation.

– Correlation dependencies from interactions.x y

The policeman fed

the cat. He did not

know that he was late.

The cat is called Peter.

The policeman fed

the cat. He did not

know that he was late.

The cat is called Peter.

Page 43: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Struct SVM for Supervised Clustering

• Representation

– y is reflexive (yii=1), symmetric (yij=yji), and transitive (if yij=1 and yjk=1, then yik=1)

– Joint feature map• Loss Function

– • Prediction

– – NP hard, use linear relaxation instead [Demaine & Immorlica, 2003]

• Find most violated constraint– – NP hard, use linear relaxation instead [Demaine & Immorlica, 2003]

1 1 1 1 0 0 01 1 1 1 0 0 01 1 1 1 0 0 01 1 1 1 0 0 00 0 0 0 1 1 10 0 0 0 1 1 10 0 0 0 1 1 1

y

1 1 1 1 0 0 01 1 1 1 0 0 01 1 1 1 0 0 01 1 1 1 0 0 00 0 0 0 1 1 10 0 0 0 1 1 10 0 0 0 1 1 1

y 1 1 1 0 0 0 01 1 1 0 0 0 01 1 1 0 0 0 00 0 0 1 0 0 00 0 0 0 1 1 10 0 0 0 1 1 10 0 0 0 1 1 1

y’

[FiJo05]

Page 44: Structured Output Prediction with Structural Support Vector Machines Thorsten Joachims Cornell University Department of Computer Science Joint work with.

Summary and Conclusions• Learning to predict complex output

– Directly model machine learning application end-to-end• An SVM method for learning with complex outputs

– General method, algorithm, and theory– Plug in representation, loss function, and separation oracle– More details and further work:

• Diversified retrieval [Yisong Yue, ICML08]• Sequence alignment [Chun-Nam Yu, RECOMB07, JCB08]• Supervised k-means clustering [Thomas Finley, forthcoming]• Approximate inference and separation oracle [Thomas Finley, ICML08]• Efficient kernelized structural SVMs [Chun-Nam Yu, KDD08]

• Software: SVMstruct – General API – Instances for sequence labeling, binary classification with non-linear loss,

context-free grammars, diversified retrieval, sequence alignment, ranking– http://svmlight.joachims.org/