Learning structured ouputs A reinforcement learning approach P. Gallinari F. Maes, L. Denoyer...
-
Upload
jagger-smee -
Category
Documents
-
view
212 -
download
0
Transcript of Learning structured ouputs A reinforcement learning approach P. Gallinari F. Maes, L. Denoyer...
Learning structured ouputsA reinforcement learning
approach
P. Gallinari
F. Maes, L. [email protected]
www-connex.lip6.fr
University Pierre et Marie Curie – Paris – Fr
Computer Science Lab.
2008-07-02 ANNPR 2008 - P. Gallinari 2
Outline
Motivation and examples Approaches for structured learning
Generative models Discriminant models
Reinforcement learning for structured prediction
Experiments
2008-07-02 ANNPR 2008 - P. Gallinari 3
Machine learning and structured data
Different types of problems Model, classify, cluster structured data Predict structured outputs Learn to associate structured
representations Structured data and applications in many
domains chemistry, biology, natural language, web,
social networks, data bases, etc
2008-07-02 ANNPR 2008 - P. Gallinari 4
Sequence labeling: POS
This Workshop brings together scientistsand engineers
DT NN VBZ RB NNS CC NNSinterested in recent developments in exploiting Massive
VBN IN JJ NNS IN VBG JJdata sets
NP NP
determiner noun Verb 3rd pers adverb Noun pluralCoord. Conj.
adjective Verb gerund
Verb plural
2008-07-02 ANNPR 2008 - P. Gallinari 5
Segmentation + labeling: syntactic chunking (Washington Univ. tagger)
This Workshop brings together scientistsand engineers
NP VP ADVP NPinterested in recent developments in
VP IN NP PNPexploiting Massive data sets
NP
Noun Phrase Verb Phrase Noun Phraseadverbial Phrase
Noun Phrase
2008-07-02 ANNPR 2008 - P. Gallinari 6
Syntaxic Parsing (Stanford Parser)
2008-07-02 ANNPR 2008 - P. Gallinari 7
Tree mapping: Problem
query heterogeneous XML collections Web information extraction
Need to know the correspondence between the structured representations usually made by hand
Learn the correspondence between the different sourcesLabeled tree mapping problem
<Restaurant><Name>La cantine</Name><Adress> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adress><Specialities> Canard à l’orange, Lapin au miel</ Specialities ></Restaurant>
<Restaurant><Name>La cantine</Name><Adress> <City>Paris</City><Stree>pyrénées</Street> <Num>65</Num></Adress><Dishes> Canard à l’orange</Dishes><Dishes> Lapin au miel</Dishes></Restaurant>
2008-07-02 ANNPR 2008 - P. Gallinari 8
Others Taxonomies Social networks Adversial computing: Webspam, Blogspam, … Translation Biology …..
2008-07-02 ANNPR 2008 - P. Gallinari 9
Is structure really useful ?Can we make use of structure ? Yes
Evidence from many domains or applications Mandatory for many problems
e.g. 10 K classes classification problem Yes but
Complex or long term dependencies often correspond to rare events
Practical evidence for large size problems Simple models sometimes offer competitive results
Information retrieval Speech recognition, etc
2008-07-02 ANNPR 2008 - P. Gallinari 10
Why is structure prediction difficult Size of the output space
# of potential labeling for a sequence # of potential joint labeling and
segmentation # of potential labeled trees for an input
sequence … Exponential output space
Inference often amounts at a combinatorial search problem
Scaling issues in most models
2008-07-02 ANNPR 2008 - P. Gallinari 11
Structured learning X, Y : input and output spaces Structured output
y Y decomposes into parts of variable size y = (y1, y2,…, yT)
Dependencies Relations between y parts Local, long term, global
Cost function O/ 1 loss: Hamming loss: F-score BLEU etc
yy ˆ*1
T
iii yy
1
ˆ*1
yy ˆ*1
2008-07-02 ANNPR 2008 - P. Gallinari 12
General approach Predictive approach:
where F : X x Y R is a score function used to rank potential outputs
F trained to optimize some loss function Inference problem
|Y| sometimes exponential Argmax is often intractable: hypothesis
decomposability of the score function over the parts of y
Restricted set of outputs
),,(maxarg)(* yxFxfyYy
i
iiYy
yxFxfy ),,(maxarg)(*
2008-07-02 ANNPR 2008 - P. Gallinari 13
Structured algorithms differ by: Feature encoding Hypothesis on the output structure Hypothesis on the cost function Type of structure prediction problem
Three main families of SP models
Generative modelsDiscriminant modelsSearch models
2008-07-02 ANNPR 2008 - P. Gallinari 15
Generative models 30 years of generative models Hidden Markov Models
Many extensions Hierarchical HMMs, Factorial HMMs, etc
Probabilistic Context Free grammars Tree models, Graphs models
2008-07-02 ANNPR 2008 - P. Gallinari 16
Usual hypothesis Features : “natural” encoding of the input Local dependency hypothesis
On the outputs (Markov) On the inputs
Cost function Usually joint likelihood Decomposes, e.g. sum of local cost on each subpart
Inference and learning usually use dynamic programming HMMs
Decoding complexity O(n|Q|2) for a sequence of length n and |Q| states
PCFGs Decoding Complexity: O(m3n3), n = length of the sentence, m =
# non terminals in the grammar Prohibitive for many problems
2008-07-02 ANNPR 2008 - P. Gallinari 17
Summary generative SP models Well known approaches Large number of existing models Local hypothesis Often imply using dynamic programming
Inference learning
2008-07-02 ANNPR 2008 - P. Gallinari 18
Discriminant models Structured Perceptron (Collins 2002) Breakthrough
Large margin methods (Tsochantaridis et al. 2004, Taskar 2004)
Many others now Considered as an extension of multi-class
clasification
2008-07-02 ANNPR 2008 - P. Gallinari 19
Usual hypothesis Joint representation of input – output Φ(x, y)
Encode potential dependencies among and between input and output e.g. histogram of state transitions observed in training
set, frequency of (xi,yj), POS tags, etc
Large feature sets (102 -> 104)
Linear score function:
Decomposability of features set (outputs) and of the loss function
),(,),,( yxyxF
2008-07-02 ANNPR 2008 - P. Gallinari 20
Extension of large margin methods 2 problems
Classical 0/1 loss is meaningless with structured outputs Structured outputs is not a simple extension of
multiclass classification Generalize max margin principle to other loss functions
than 0/1 loss Number of constraints for the QP problem is
proportional to |Y|, i.e. potentially exponential Different solutions
Consider only a polynomial number of “active” constraints -> bounds on the optimality of the solution
Learning require solving an Argmax problem at each iteration i.e. Dynamic programming
2008-07-02 ANNPR 2008 - P. Gallinari 21
Summary: discriminant approaches
Hypothesis Local dependencies for the output Decomposability of the loss function
Long term dependencies in the input Nice convergence properties + bounds Complexity
Learning often does not scale
Reinforcement learning for structured outputs
2008-07-02 ANNPR 2008 - P. Gallinari 23
Precursors
Incremental Parsing, Collins 2004 SEARN, Daume et al. 2006 Reinforcement Learning, Maes et al. 2007
2008-07-02 ANNPR 2008 - P. Gallinari 24
General idea The structured output will be built incrementally
ŷ = (ŷ1, ŷ2,…, ŷT) Different from solving directly the prediction preblem
It will by build via Machine learning
Loss :
Goal Construct incrementally ŷ so as to minimize the loss Training: learn how to search the solution space Inference: build ŷ as a sequence of successive decisions
)ˆ,(
yy
XEC
),,(maxarg)(* yxFxfyYy
2008-07-02 ANNPR 2008 - P. Gallinari 25
Example : sequence labelling
Example
2 labels R et B
Search space :
(input sequence, {sequence of labels})
For a size 3 sequence
x = x1 x2 x3 :
A node represents a state in the search space
2008-07-02 ANNPR 2008 - P. Gallinari 26
Example:actions and decisions
Actions
Π(s, a)
Π(s, a)
Π(s, a)
Π(s, a)
Sequence of size 3
target :
Π(s, a) : decision function
2008-07-02 ANNPR 2008 - P. Gallinari 27
Example : expected loss
C=1
C=1
C=1
C=1
C=1
C=1
C=1
C=0
C=0
C=0
C=0
C=0
C=0CT=1
CT=2
CT=2
CT=3
CT=0
CT=1
CT=1
CT=2
C=0
Sequence of size 3
target :
Loss does not always
separate !!
2008-07-02 ANNPR 2008 - P. Gallinari 28
Inference
Suppose we have a policy function F which decides at each step which action to take
Inference could be performed by computing ŷ1= F(x,. ), ŷt= F(ŷ1,… ŷt-1),…, ŷT= F(ŷ1,… , ŷT-1) ŷ = F(ŷ1,… , ŷT)
No Dynamic programming needed
2008-07-02 ANNPR 2008 - P. Gallinari 29
Example: state space exploration guided by local costs
C=1
C=1
C=1
C=1
C=1
C=1
C=1
C=0
C=0
C=0
C=0
C=0
C=0CT=1
CT=2
CT=2
CT=3
CT=0
CT=1
CT=1
CT=2
C=0
Sequence of size 3
target :
Goal:
generalize to unseen situation
2008-07-02 ANNPR 2008 - P. Gallinari 30
Training Learn a policy F
From a set of (input, output) pairs Which takes the optimal action at each state Which is able to generalize to unknown
situations
2008-07-02 ANNPR 2008 - P. Gallinari 31
Reinforcement learning for SP Formalizes search based ideas using Markov Decision
Processes as model and Reinforcement Learning for training
Provides a general framework for this approach Many RL algorithm could be used for training
Differences with classical use of RL Specific hypothesis
Deterministic environment Problem size
State : up to 106 variables Generalization properties
2008-07-02 ANNPR 2008 - P. Gallinari 32
Markov Decision Process A MDP is a tuple (S, A, P, R)
S is the State space, A is the action space, P is a transition function describing the
dynamic of the environment P(s, a, s’) = P(st+1 = s’| st = s, at = a)
R is a reward function R(s, a, s’) = E[rt|st+1 = s’, st = s, at = a)
2008-07-02 ANNPR 2008 - P. Gallinari 33
Example: Prediction problem
2008-07-02 ANNPR 2008 - P. Gallinari 34
Example: State space Deterministic MDP
States: input + partial output
Actions: elementary modifications of the partial output
Rewards: quality of prediction
2008-07-02 ANNPR 2008 - P. Gallinari 35
Example: partially observable reward
The reward function is only partially observable : limited set of training (input, output)
2008-07-02 ANNPR 2008 - P. Gallinari 36
Learning the MPDP The reward function cannot be computed on the whole MDP Approximated RL
Learn the policy on a subspace Generalize to the whole space
The policy is learned as a linear function
Φ(s, a) is a joint description of (s, a) θ is a ,parameter vector
Learning algorithm Sarsa, Q-learning, etc
),,(maxarg),( asasActionsa
2008-07-02 ANNPR 2008 - P. Gallinari 37
Inference State at step t
Initial input Partial output
Once the MDP is learned Using the state at step t, compute state t+1 Until a solution is reached
Greedy approach Only the best action is chosen at each time
2008-07-02 ANNPR 2008 - P. Gallinari 38
Structured outputs and MDP State st
Input x + partial output ŷt
Initial state : (x,) Actions
Task dependent POS: new tag for the current word Tree mapping: insert a new path in a partial tree
Inference Greedy approach
Only the best action is chosen at each time Reward
Final: Heuristic:
)ˆ,( yyR )ˆ,( tyyr
2008-07-02 ANNPR 2008 - P. Gallinari 39
Exemple: sequence labeling Left Right model
Actions: label Order free model
Actions: label + position Loss : Hamming cost or F score Tasks
Named entity recognition (Shared task at CoNNL 2002 - 8 000 train, 1500 test)
Chunking – Noun Phrases (CoNNL 2002) Handwriten word recognition (5000 train, 1000 test)
Complexity of inference O(sequence size * number of labels)
2008-07-02 ANNPR 2008 - P. Gallinari 40
2008-07-02 ANNPR 2008 - P. Gallinari 41
Tree mapping: Document mapping Problem
Labeled tree mapping problemDifferent instances
Flat text to XMLHTML to XML
XML to XML….
<Restaurant><Nom>La cantine</Nom><Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adresse><Spécialités> Canard à l’orange, Lapin au miel</Spécialités></Restaurant>
<Restaurant><Nom>La cantine</Nom><Adresse> <Ville>Paris</Ville> <Arrd>19</Arrd> <Rue>pyrénées</Rue> <Num>65</Num></Adresse><Plat> Canard à l’orange</Plat><Plat> Lapin au miel</Plat></Restaurant>
2008-07-02 ANNPR 2008 - P. Gallinari 42
Document mapping problem Central issue: Complexity
Large collections Large feature space: 103 to 106
Large search space (exponential) Other SP methods fail
2008-07-02 ANNPR 2008 - P. Gallinari 43
XML structuration Action
Attach path ending with the current leaf to a position in the current partial tree
Φ(a,s) encodes a series of potential (state, action) pairs
Loss: F-Score for trees
2008-07-02 ANNPR 2008 - P. Gallinari 44
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
INPUT DOCUMENT
TARGET DOCUMENT
2008-07-02 ANNPR 2008 - P. Gallinari 45
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
2008-07-02 ANNPR 2008 - P. Gallinari 46
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
2008-07-02 ANNPR 2008 - P. Gallinari 47
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
2008-07-02 ANNPR 2008 - P. Gallinari 48
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
2008-07-02 ANNPR 2008 - P. Gallinari 49
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
2008-07-02 ANNPR 2008 - P. Gallinari 50
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
Francis MAES
AUTHOR
2008-07-02 ANNPR 2008 - P. Gallinari 51
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
Francis MAES
AUTHOR
2008-07-02 ANNPR 2008 - P. Gallinari 54
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
Title of the section
SECTION
TITLE
AUTHOR
Francis MAES
2008-07-02 ANNPR 2008 - P. Gallinari 55
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
2008-07-02 ANNPR 2008 - P. Gallinari 56
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
2008-07-02 ANNPR 2008 - P. Gallinari 57
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
2008-07-02 ANNPR 2008 - P. Gallinari 58
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
This is afootnote
2008-07-02 ANNPR 2008 - P. Gallinari 59
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
This is afootnote
2008-07-02 ANNPR 2008 - P. Gallinari 60
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
INPUT DOCUMENT
TARGET DOCUMENT
2008-07-02 ANNPR 2008 - P. Gallinari 61
Results
2008-07-02 ANNPR 2008 - P. Gallinari 62
Conclusion Learn to explore the state space of the problem Alternative to DP or classical search algorithms Could be used with any decomposable cost
function Fewer assumptions than other methods (e.g.
no Markov hypothesis) Leads to solutions that may scale with complex
and large SP problems
2008-07-02 ANNPR 2008 - P. Gallinari 63
Paper at European Workshop on Reinforcement Learning 2008
Preliminary paper ECML 2007 Library available http://nieme.lip6.fr