Dual Composition Applications Presenters: Ben & Red.

54
Dual Composition Applications Presenters: Ben & Red

Transcript of Dual Composition Applications Presenters: Ben & Red.

Dual Composition Applications

Presenters: Ben & Red

Section(s) 1, 2 and 3:Introduction to Dual Composition

Abstract

Dual decomposition, and more generally Lagrangian relaxation, is a classical problems in natural language processing(NLP)

A central theme of this tutorial is that Lagrangian relaxation is naturally applied in conjunction with a broad class of combinatorial algorithms.

Introduction

In many problems in statistical natural language processing, the task is to map some input x(a string) to some structured output y(a parse tree).This mapping is often defined as

y*= h(y)

Where γ is a finite set of possible structures for the input x, and h: γ->R is a function that assigns a score h(y) to each y in γ.

ymaxarg

Introduction

The problem of finding y* is referred to as the decoding problem. The size of γ typically grows exponentially with respect to the size of the input x, making exhaustive search for y* intractable.

Introduction

Dual decomposition leverages the observation that many decoding problems can be decomposed into two or more sub-problems, together with linear constraints that enforce some notion of agreement between solutions to the different problems.

The sub-problems are chosen such that they can be solved efficiently using exact combinatorial algorithms.

PropertiesProperties

They are typically simple and efficient.

They have well-understood formal properties, in particular through connections to linear programming relaxations.

In cases where the underlying LP relaxation is tight, they produce an exact solution to the original decoding problem, with a certificate of optimality.

They are typically simple and efficient.

They have well-understood formal properties, in particular through connections to linear programming relaxations.

In cases where the underlying LP relaxation is tight, they produce an exact solution to the original decoding problem, with a certificate of optimality.

Dual DecompositionDual Decomposition

Some finite set γ Rd,Z Rd’ .Each vector y∈γ and z Z has a associated score ∈

f(y)=y·θ(1) , g(z)=z·θ(2)

where θ(1) is a vector in Rd

The decoding problem is then to find

Such that Ay+Cz=bAy+Cz=b

Where A R∈ p ×d ,C R∈ p ×d’ ,b R∈ p

Some finite set γ Rd,Z Rd’ .Each vector y∈γ and z Z has a associated score ∈

f(y)=y·θ(1) , g(z)=z·θ(2)

where θ(1) is a vector in Rd

The decoding problem is then to find

Such that Ay+Cz=bAy+Cz=b

Where A R∈ p ×d ,C R∈ p ×d’ ,b R∈ p

)2()1(

,maxarg

zy

Zzy

Section 4:Optimization of the Finite State Tagger

x: an input sentencex: an input sentence

y: a parse treey: a parse tree

γ γ : the set of all parse trees for x: the set of all parse trees for x

h(y): the score for any parse tree y ∈h(y): the score for any parse tree y ∈ γ γ

We want to find : y*= h(y)We want to find : y*= h(y)

x: an input sentencex: an input sentence

y: a parse treey: a parse tree

γ γ : the set of all parse trees for x: the set of all parse trees for x

h(y): the score for any parse tree y ∈h(y): the score for any parse tree y ∈ γ γ

We want to find : y*= h(y)We want to find : y*= h(y)ymaxarg

Second, find the score for the part-of-speech(POS) sequence in y under a finite-state part –of-speech tagging model

First, find the score for y under a weighted context-free grammar

h(y)=f(y)+g(l(y))

f(y) is the score for y under a weighted context-free grammar(WCFG).A WCFG consists of a context- free grammar with a set of rules G, A scoring function θ:G->R assigns a real-valued score to each rule in G

f(y) is the score for y under a weighted context-free grammar(WCFG).A WCFG consists of a context- free grammar with a set of rules G, A scoring function θ:G->R assigns a real-valued score to each rule in G

f(y)= θ(S->NP VP) + θ(NP->N) + θ(N->United)+ θ(VP->V NP)+…

How to define θ is not very importantθ(a->b)=log p(a->b) in a probabilistic context-free grammarΘ(a->b)=w*Φ(a->b) in a conditional random field

h(y)=f(y)+g(l(y))

l(y) is a function that maps a parse tree y to the sequence of part-of-speech tags in yl(y) is a function that maps a parse tree y to the sequence of part-of-speech tags in y

h(y)=f(y)+g(l(y))

g(z) is the score for the part-of-speech tag sequence z under a k’th-order finite-state tagging modelg(z) is the score for the part-of-speech tag sequence z under a k’th-order finite-state tagging model

Under this modelWhere θ is the score for the sub-sequence of tags ending at position i in the sentence.

n

iikiki zzzizg

11 ),...,,,()(

MotivationMotivationMotivationMotivation

The scoring function h(y)=f(y)+g(l(y)) combines information from both the parsingmodel and the tagging model.

The scoring function h(y)=f(y)+g(l(y)) combines information from both the parsingmodel and the tagging model.

The POS tagger captures information about adjacent POS tags that will missing under f(y).This information may improve both parsing and tagging performance, in comparison to using f(y) alone.

The POS tagger captures information about adjacent POS tags that will missing under f(y).This information may improve both parsing and tagging performance, in comparison to using f(y) alone.

The dual decomposition algorithmThe dual decomposition algorithm

Define T to be the set of all POS tags.Assume the input sentence has n words.

otherwise

ipositionatttaghasytreeparseiftiy

...0

............1),(

otherwise

ipositionatttaghassequencetagtheiftiz

...0

............1),(

Find

Such that for all i {1…n}, for all t T, y(I,t)=z(I,t) ∈ ∈

)()(.maxarg,

zgyfZzy

Assume that we introduce variables u( i, t) R for ∈i {i…n}, and t T. We assume that for any ∈ ∈value of these variables, we can find

efficiently.

Assume that we introduce variables u( i, t) R for ∈i {i…n}, and t T. We assume that for any ∈ ∈value of these variables, we can find

efficiently.

)),(),()((maxarg,

tiytiytiuyf

Assumption 2Assumption 2Assumption 2Assumption 2

tiZz

tiztiuzg,

)),(),()((maxarg

The dual decomposition algorithm for integrated parsing and taggingThe dual decomposition algorithm for integrated parsing and tagging

Initialization: Set for all i {1…n}, t T∈ ∈For k=1 to K [Parsing] [Tagging]If for all i,t Return Else δk for k=1…K is the step size at the k’th iteration.

Initialization: Set for all i {1…n}, t T∈ ∈For k=1 to K [Parsing] [Tagging]If for all i,t Return Else δk for k=1…K is the step size at the k’th iteration.

0),()0( tiu

ti

ky

k tiytiuyfy,

)1()( )),(),()((maxarg

ti

kZz

k tiztiuzgz,

)1()( )),(),()((maxarg

),(),( )()( tiztiy kk ),( )()( kk zy

)),(),((),(),( )()()()1( tiztiytiutiu kkk

kk

Relationship to decoding problemThe set γ is the set of all parses for the input sentenceThe set Z is the set of all POS sequences for the input sentence.Each parse tree y R∈ d is represented as a vector such that f(y)=u·θ(1) for some θ(1)∈Rd

Each tag sequence z R∈ d’ is represented as a vector such that g(z)=z·θ(2) for some θ(2)∈Rd’

The set γ is the set of all parses for the input sentenceThe set Z is the set of all POS sequences for the input sentence.Each parse tree y R∈ d is represented as a vector such that f(y)=u·θ(1) for some θ(1)∈Rd

Each tag sequence z R∈ d’ is represented as a vector such that g(z)=z·θ(2) for some θ(2)∈Rd’

The constraints

y(i,t)=z(i,t)for all (i,t) can be encoded though linear constraints

Ay+Cz=bFor suitable choices of A,C and bAssuming that the vectors y and z include components y(i,t) and z(i,t) respectively.

The constraints

y(i,t)=z(i,t)for all (i,t) can be encoded though linear constraints

Ay+Cz=bFor suitable choices of A,C and bAssuming that the vectors y and z include components y(i,t) and z(i,t) respectively.

Section 5: Formal Properties of the Finite State Tagger

Recall

The equation we are optimizing:

Lagrangian Relaxation and Optimization We first define a Lagrangian eq. for the optimization.

Next, from this we derive a dual objective.

Finally, we must minimize the dual objective.

(We do this with sub-gradient optimization)

Creation of the Dual Objective

Find the Lagrangian: Introduce Lagrangian Multiplier u(i, t) for each y(i, t) = z(i, t)

From the equality constraint we get:

Next we regroup based on y and z to get the dual objective:

Minimization of the Lagrangian (1)

We need to minimize the Lagrangian: Why? We see that for any value of u:

The proof is elementary given the equality constraint:

This is because if y(i, t) = z(i, t) for all (i, t) we have:

Therefore, our goal is to minimize the Lagrangian, as seen above because it sets an upper limit on the dual.

Minimizing the Lagrangian (2)

Recall the Algorithm introduced in Section 4:

(The algorithm for dual composition)From this we see the algorithm converges tothe tightest possible upper bound given by thedual. From this we derive the theorem:

given

we end up with (as we reach infinite iterations)

Given infinite iterations, this dual is guaranteed to d-converge.

This is actually a Sub-gradient minimization methodology, which we will return to in the next section.

Minimizing the Lagrangian (3)

From the previous slide, we can derive these two functions:

So, if there exists some u such that:

For all (i, t),

Then,

(or )

The proof of this is also, relatively simple:

From the definitions of

Because, once again,

We also know that, , for all u, which implies:

However, because y* and z* are optimal, we also see:

Therefore we have proven,

This equation is said to e-converge under these circumstances because we have found an iteration that completely converges to the optimization by bounding it from about and below.

D-convergence vs. E-convergence

D-convergence occurs in the first case, D-Convergence is guaranteed by Dual Composition provided infinite iterations.

Short for 'dual convergence'.

It refers to the convergence of the minimum dual value algorithm:

( ).

E-convergence occurs in the later case, E-convergence is not guaranteed by Dual Composition.

Short for 'exact convergence'.

It refers to the convergence of ALL constraints.

(Which implies the discovery of a guaranteed optimization)

Sub-gradient Optimization

What are sub-gradient algorithms? Algorithm used to optimize convex functions.

Uses gradient descent to find the local minimum. Takes iterative steps negative to the approximate gradient relative to

the current point in order to find a local minimum of any function.

To the right, we see a gradient optimization for:

(Credits to: http://en.wikipedia.org/wiki/Gradient_descent )

The 'zig-zag' nature of the algorithm is obvious here.

Sub-gradient Applications

Recall the equation (and the goal to minimize it):

We notice two things about the equation: It is a convex function:

Formally defined:

It is non-differentiable (It is a piecewise linear function).

Because it can't be differentiated, we can't use gradient descent.

However, because it is a convex function, we can use a gradient algorithm.

This is the property that allows us to compute duals efficiently.

Section 6:Additional Applications of Dual Composition

Applications of Dual Composition Where is Dual Composition and Lagrangian Relaxation useful?

Various NP-Hard Problems such as (as the paper suggests): Language translation. Routing problems or Traveling Salesman problems. Syntactical parsing based on complex CFGs. The MAP problem

Each of these techniques will be covered in the following segment.

Other examples of NP-Hard problems solvable by DC optimization:

Minimum degree/leaf spanning tree Steiner tree problem Chinese postman problem Vehicle routing problem Bipartite match A lot of other problems, many can be found on Wikipedia*: “ http://en.wikipedia.org/wiki/List_of_NP-complete_problems “* Some of these on the list are not applicable ( Just be wary when researching, use your brain! )

Example 1: The MAP Problem

As this was covered by the previous group we will not discuss it in detail.

Goal was to find optimization of:

We break the problem up into easier sub problems.

(Into sub-trees of the main graph) We then solve these 'easier' problems using an iterative sub-

gradient algorithm.

Doable because it is convex. The sub-gradient algorithm for any given iteration (k):

Example 2: Traveling Salesman Problems

The Traveling Salesman Problem (TSP) is one of the most famous NP-Hard problems of all time:

Goal is to find the maximum scoring route that visits all node at least once.

More formally stated it is the optimization of this problem:

Given that:

Y is the set of all possible tours.

E is the set of all edges.

Theta is the weight of edge e.

ye is 1 iff e is in the subset, 0 otherwise.

Held & Karp made this an easy problem to solve (well, “easy”).

(TSP) Held & Karp's idea

The idea of a 1-tree: A 1-tree is a tree with a single cycle.

Most importantly, notice how any possible route of the TSP is a 1-tree.

How we can use this information: Alter the problem to be: where Y' is the set of all possible 1-trees.

(Hence, Y is a subset of Y' b/c all tours are 1-trees)

However, the first step becomes to find the Maximum Scoring Tree.

And the second is to add the two highest scoring edges that include vertex 1.

Lastly, we add constraints on all edges in the form of the following:

This constrains the equation to 1-trees that each have 2 edges per vertex.

(This is the equivalent to a route and, when optimized, the best route)

(TSP) Getting to the Solution

Next, we introduce the following Lagrangian:

(Where are the Lagrange multipliers).

And it is clearly constrained by 2 edges per vertex.

And define the dual objective as:

To finally end up with the following as our sub-gradient:

So, if the constraints hold the algorithm terminates and we have d-convergence; however, for e-convergence we must reevaluate the weights for the graph to include the Lagrange Multipliers (not shown).

Example 3: Phrase Based Translation

Phrase based translation is a scheme to translate sentences from one language to another based on the scoring of certain phrases within a sentence.

Takes in an input sentence in the first language.

Pairs sequences of words in first language with

the same sequence of words in the target lang.

For example, (1, 3, we must also), means that

words 1, 2 and 3 in the input could possibly mean “we must also”.

Then, once we have a potential translation (all words mapped), we score it.

The scoring function for any set of potential tuple translations is:

Where g(y) is the log-probability of y under a n-gram language model.

(PBT) Constraining the Equation

First, we must find a constraint for the function. The constraint would be to ensure that any correct translation only

translates each word in the input 1 time in the output.

Defining this formally, we introduce y(i), to represent the number of times word i is translated in the output:

So, intuitively, the constraint would become:

Where P* is the set of all finite phrase sequences. So, the goal becomes to maximize the score as follows:

(PBT) Relaxation and Solution

The relaxation for this algorithm would be:

Instead of solving the NP-hard problem, we find Y' such that Y is a subset of Y’.

The relaxation is based on:

(In other words, the relaxation only checks to see if there are n words in the generated phrase)

So, is a member of Y' and not Y.

So, from this, our goal is to ensure that the constraint y(i) = 1 is met for all i.

Thus, we introduce Lagrange multipliers for each constraint, giving us the Lagrangian:

From this, we define u(i) = 0 initially to get the iteration, and then the sub-gradient:

We can see when we have y(i) = 1, for all i = {1,...,n} the sub-gradient step is optimal.

Then, like in the TSP problems, we must reweigh the scores with the Lagrange Multipliers (if we want e-convergence):

Section 7:Practical Issues of Dual Composition

Practical Issues

Each iteration of the algorithm produces a number of useful terms:

Y(k) and z(k).

The current dual value L(u(k))

Each iteration of the algorithm produces a number of useful terms:

Y(k) and z(k).

The current dual value L(u(k))

When we have a function l:γ->Z that maps each structure y∈γ to a structure l(y) Z , ∈we also haveA primal solution y(k) , l(y(k))A primal value f(y(k))+g(l(y(k)))

When we have a function l:γ->Z that maps each structure y∈γ to a structure l(y) Z , ∈we also haveA primal solution y(k) , l(y(k))A primal value f(y(k))+g(l(y(k)))

An Example run of the algorithm

Because L(u) provides an upper bound on f(y*)+g(z*) for any value of u, we have L(u(k))>=f(y(k))+g(l(y(k))) at every iteration

We have e-convergence to an exact solution, which L(u(k))=f(y(k))+g(z(k)) with (y(k),z(k)) guaranteed to be optimal.

The dual values L(u(k)) are not monotonically decreasing

The primal value f(y(k))+g(z(k)) fluctuates

Useful quantities

L(u(k))-L(u(k-1)) is the change in the dual value from one iteration to the next.

L*k=mink’<=kL(u(k’)) is the best dual value found so far.

p*k=maxk’<=kf(y(k’))+g(l(y(k’))) is the best primal value found so far.

L*k-p*k is the gap between the best dual and best primal solution found so far by the algorithm

Choice of the step sizes δk

The dual value L(u) versus the number of iterations k,For different fixed step sizes.

Small step size-convergence is smoothLarge step size-dual fluctuates erratically

1t

ck

C>0 is a constantt is the number of iterationsPrior to k where the dual valueincreases

Recovering Approximate Solutions

Early Stopping

Section 8:Linear Programming Relaxations(A Basic Introduction & Summary)

Recall

The equations from Sections 3, 4 and 5:

Defining the LP Relaxation

We first define two sets, one for y the other for z:

Both y and z are simplex, corresponding to the set of probability distributions over Y.

From this, we can define the new optimization to be:

This optimization is now a Linear Program.

Next, we would define y' (and z' in a similar manner) as:

Such that y is a subset of y' (Sounds familiar, right?)

So, finally we get the optimization problem to be:

Defining the LP Relaxation (2) Finally, we end up with the Lagrangian (Note that it is identical to the one in section 5):

The dual objective is also rather similar:

We notice that all of these Linear Programming principles are VERY close to the Lagrangian relaxation that we see in Dual Composition.

However, Linear Programming problems are often solved to tighten the e-convergence on Dual Composition algorithms.

Because e-convergence is not guaranteed, Linear Programs can often 'fix' the problems where dual composition only d-converges.

We can do this by constraining the Linear Program further.

Summary of LP Relaxation

The dual L(u) from Sect 4/5 is the same as the dual for the LP Relaxation M(u, a, b).

LP Relaxation is used as a tightening method. These methods improve convergence.

Does this by selectively adding more constraints. LP solvers exist and therefore LP problems can

be automatically solved by software. This can be useful for debugging hard DC problems.

Section 9:Conclusion

Conclusion

Dual Composition... Is useful in solving various machine learning problems.

Is useful in solving various inference / statistical NLP problems.

Uses combinatorial algorithms combined with constraints.

Lagrangian Relaxation... Introduces constraints using Lagrange multipliers.

Iterative methods are then used to solve the resulting LR. The LR algorithm is always bounded by the dual objective.

Is similar to linear programming relaxation.

(Most importantly) has the potential to produce the exact solution.

Therefore, DC and LR produce optimal solutions more efficiently than many exact methods, or other methodologies of approximation.

Questions?Perhaps… Answers?