Learning structured ouputs A reinforcement learning approach P. Gallinari F. Maes, L. Denoyer...

Learning structured ouputsA reinforcement learning

approach

P. Gallinari

F. Maes, L. [email protected]

www-connex.lip6.fr

University Pierre et Marie Curie – Paris – Fr

Computer Science Lab.

2008-07-02 ANNPR 2008 - P. Gallinari 2

Outline

Motivation and examples Approaches for structured learning

Generative models Discriminant models

Reinforcement learning for structured prediction

Experiments


Machine learning and structured data

Different types of problems Model, classify, cluster structured data Predict structured outputs Learn to associate structured

representations Structured data and applications in many

domains chemistry, biology, natural language, web,

social networks, data bases, etc


Sequence labeling: POS

This Workshop brings together scientistsand engineers

DT NN VBZ RB NNS CC NNSinterested in recent developments in exploiting Massive

VBN IN JJ NNS IN VBG JJdata sets

NP NP

determiner noun Verb 3rd pers adverb Noun pluralCoord. Conj.

adjective Verb gerund

Verb plural


Segmentation + labeling: syntactic chunking (Washington Univ. tagger)

This Workshop brings together scientistsand engineers

NP VP ADVP NPinterested in recent developments in

VP IN NP PNPexploiting Massive data sets

NP

Noun Phrase Verb Phrase Noun Phraseadverbial Phrase

Noun Phrase


Syntaxic Parsing (Stanford Parser)


Tree mapping: Problem

query heterogeneous XML collections Web information extraction

Need to know the correspondence between the structured representations usually made by hand

Learn the correspondence between the different sourcesLabeled tree mapping problem

<Restaurant><Name>La cantine</Name><Adress> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adress><Specialities> Canard à l’orange, Lapin au miel</ Specialities ></Restaurant>

<Restaurant><Name>La cantine</Name><Adress> <City>Paris</City><Stree>pyrénées</Street> <Num>65</Num></Adress><Dishes> Canard à l’orange</Dishes><Dishes> Lapin au miel</Dishes></Restaurant>


Others Taxonomies Social networks Adversial computing: Webspam, Blogspam, … Translation Biology …..


Is structure really useful ?Can we make use of structure ? Yes

Evidence from many domains or applications Mandatory for many problems

e.g. 10 K classes classification problem Yes but

Complex or long term dependencies often correspond to rare events

Practical evidence for large size problems Simple models sometimes offer competitive results

Information retrieval Speech recognition, etc


Why is structure prediction difficult Size of the output space

# of potential labeling for a sequence # of potential joint labeling and

segmentation # of potential labeled trees for an input

sequence … Exponential output space

Inference often amounts at a combinatorial search problem

Scaling issues in most models


Structured learning X, Y : input and output spaces Structured output

y Y decomposes into parts of variable size y = (y1, y2,…, yT)

Dependencies Relations between y parts Local, long term, global

Cost function O/ 1 loss: Hamming loss: F-score BLEU etc

yy ˆ*1

T

iii yy

1

ˆ*1

yy ˆ*1


General approach Predictive approach:

where F : X x Y R is a score function used to rank potential outputs

F trained to optimize some loss function Inference problem

|Y| sometimes exponential Argmax is often intractable: hypothesis

decomposability of the score function over the parts of y

Restricted set of outputs

),,(maxarg)(* yxFxfyYy

i

iiYy

yxFxfy ),,(maxarg)(*


Structured algorithms differ by: Feature encoding Hypothesis on the output structure Hypothesis on the cost function Type of structure prediction problem

Three main families of SP models

Generative modelsDiscriminant modelsSearch models


Generative models 30 years of generative models Hidden Markov Models

Many extensions Hierarchical HMMs, Factorial HMMs, etc

Probabilistic Context Free grammars Tree models, Graphs models


Usual hypothesis Features : “natural” encoding of the input Local dependency hypothesis

On the outputs (Markov) On the inputs

Cost function Usually joint likelihood Decomposes, e.g. sum of local cost on each subpart

Inference and learning usually use dynamic programming HMMs

Decoding complexity O(n|Q|2) for a sequence of length n and |Q| states

PCFGs Decoding Complexity: O(m3n3), n = length of the sentence, m =

# non terminals in the grammar Prohibitive for many problems


Summary generative SP models Well known approaches Large number of existing models Local hypothesis Often imply using dynamic programming

Inference learning


Discriminant models Structured Perceptron (Collins 2002) Breakthrough

Large margin methods (Tsochantaridis et al. 2004, Taskar 2004)

Many others now Considered as an extension of multi-class

clasification


Usual hypothesis Joint representation of input – output Φ(x, y)

Encode potential dependencies among and between input and output e.g. histogram of state transitions observed in training

set, frequency of (xi,yj), POS tags, etc

Large feature sets (102 -> 104)

Linear score function:

Decomposability of features set (outputs) and of the loss function

),(,),,( yxyxF


Extension of large margin methods 2 problems

Classical 0/1 loss is meaningless with structured outputs Structured outputs is not a simple extension of

multiclass classification Generalize max margin principle to other loss functions

than 0/1 loss Number of constraints for the QP problem is

proportional to |Y|, i.e. potentially exponential Different solutions

Consider only a polynomial number of “active” constraints -> bounds on the optimality of the solution

Learning require solving an Argmax problem at each iteration i.e. Dynamic programming


Summary: discriminant approaches

Hypothesis Local dependencies for the output Decomposability of the loss function

Long term dependencies in the input Nice convergence properties + bounds Complexity

Learning often does not scale

Reinforcement learning for structured outputs


Precursors

Incremental Parsing, Collins 2004 SEARN, Daume et al. 2006 Reinforcement Learning, Maes et al. 2007


General idea The structured output will be built incrementally

ŷ = (ŷ1, ŷ2,…, ŷT) Different from solving directly the prediction preblem

It will by build via Machine learning

Loss :

Goal Construct incrementally ŷ so as to minimize the loss Training: learn how to search the solution space Inference: build ŷ as a sequence of successive decisions

)ˆ,(

yy

XEC

),,(maxarg)(* yxFxfyYy


Example : sequence labelling

Example

2 labels R et B

Search space :

(input sequence, {sequence of labels})

For a size 3 sequence

x = x1 x2 x3 :

A node represents a state in the search space


Example:actions and decisions

Actions

Π(s, a)

Π(s, a)

Π(s, a)

Π(s, a)

Sequence of size 3

target :

Π(s, a) : decision function


Example : expected loss

C=1

C=1

C=1

C=1

C=1

C=1

C=1

C=0

C=0

C=0

C=0

C=0

C=0CT=1

CT=2

CT=2

CT=3

CT=0

CT=1

CT=1

CT=2

C=0

Sequence of size 3

target :

Loss does not always

separate !!


Inference

Suppose we have a policy function F which decides at each step which action to take

Inference could be performed by computing ŷ1= F(x,. ), ŷt= F(ŷ1,… ŷt-1),…, ŷT= F(ŷ1,… , ŷT-1) ŷ = F(ŷ1,… , ŷT)

No Dynamic programming needed


Example: state space exploration guided by local costs

C=1

C=1

C=1

C=1

C=1

C=1

C=1

C=0

C=0

C=0

C=0

C=0

C=0CT=1

CT=2

CT=2

CT=3

CT=0

CT=1

CT=1

CT=2

C=0

Sequence of size 3

target :

Goal:

generalize to unseen situation


Training Learn a policy F

From a set of (input, output) pairs Which takes the optimal action at each state Which is able to generalize to unknown

situations


Reinforcement learning for SP Formalizes search based ideas using Markov Decision

Processes as model and Reinforcement Learning for training

Provides a general framework for this approach Many RL algorithm could be used for training

Differences with classical use of RL Specific hypothesis

Deterministic environment Problem size

State : up to 106 variables Generalization properties


Markov Decision Process A MDP is a tuple (S, A, P, R)

S is the State space, A is the action space, P is a transition function describing the

dynamic of the environment P(s, a, s’) = P(st+1 = s’| st = s, at = a)

R is a reward function R(s, a, s’) = E[rt|st+1 = s’, st = s, at = a)


Example: Prediction problem


Example: State space Deterministic MDP

States: input + partial output

Actions: elementary modifications of the partial output

Rewards: quality of prediction


Example: partially observable reward

The reward function is only partially observable : limited set of training (input, output)


Learning the MPDP The reward function cannot be computed on the whole MDP Approximated RL

Learn the policy on a subspace Generalize to the whole space

The policy is learned as a linear function

Φ(s, a) is a joint description of (s, a) θ is a ,parameter vector

Learning algorithm Sarsa, Q-learning, etc

),,(maxarg),( asasActionsa


Inference State at step t

Initial input Partial output

Once the MDP is learned Using the state at step t, compute state t+1 Until a solution is reached

Greedy approach Only the best action is chosen at each time


Structured outputs and MDP State st

Input x + partial output ŷt

Initial state : (x,) Actions

Task dependent POS: new tag for the current word Tree mapping: insert a new path in a partial tree

Inference Greedy approach

Only the best action is chosen at each time Reward

Final: Heuristic:

)ˆ,( yyR )ˆ,( tyyr


Exemple: sequence labeling Left Right model

Actions: label Order free model

Actions: label + position Loss : Hamming cost or F score Tasks

Named entity recognition (Shared task at CoNNL 2002 - 8 000 train, 1500 test)

Chunking – Noun Phrases (CoNNL 2002) Handwriten word recognition (5000 train, 1000 test)

Complexity of inference O(sequence size * number of labels)


Tree mapping: Document mapping Problem

Labeled tree mapping problemDifferent instances

Flat text to XMLHTML to XML

XML to XML….

<Restaurant><Nom>La cantine</Nom><Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adresse><Spécialités> Canard à l’orange, Lapin au miel</Spécialités></Restaurant>

<Restaurant><Nom>La cantine</Nom><Adresse> <Ville>Paris</Ville> <Arrd>19</Arrd> <Rue>pyrénées</Rue> <Num>65</Num></Adresse><Plat> Canard à l’orange</Plat><Plat> Lapin au miel</Plat></Restaurant>


Document mapping problem Central issue: Complexity

Large collections Large feature space: 103 to 106

Large search space (exponential) Other SP methods fail


XML structuration Action

Attach path ending with the current leaf to a position in the current partial tree

Φ(a,s) encodes a series of potential (state, action) pairs

Loss: F-Score for trees


HTML

HEAD BODY

TITLE IT H1 P FONT

Example Francis MAESTitle of the

sectionWelcome to INEX

This is a footnote

Example

TITLE

DOCUMENT

AUTHOR

Francis MAESTitle of the

section

SECTION

TITLE

Welcome toINEX

TEXT

INPUT DOCUMENT

TARGET DOCUMENT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

Francis MAES

AUTHOR


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

Title of the section

SECTION

TITLE

AUTHOR

Francis MAES


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT

This is afootnote


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT

INPUT DOCUMENT

TARGET DOCUMENT


Results


Conclusion Learn to explore the state space of the problem Alternative to DP or classical search algorithms Could be used with any decomposable cost

function Fewer assumptions than other methods (e.g.

no Markov hypothesis) Leads to solutions that may scale with complex

and large SP problems


More on this

Paper at European Workshop on Reinforcement Learning 2008

Preliminary paper ECML 2007 Library available http://nieme.lip6.fr

Learning structured ouputs A reinforcement learning approach P. Gallinari F. Maes, L. Denoyer...

Documents

Transcript of Learning structured ouputs A reinforcement learning approach P. Gallinari F. Maes, L. Denoyer...