Type-Driven Syntax and Semantics for Composing Meaning Vectorssc609/pubs/comp_draft.pdf ·...

Type-Driven Syntax and Semantics for

Composing Meaning Vectors

Stephen Clark

University of Cambridge Computer Laboratory

William Gates Building, 15 JJ Thomson Avenue

Cambridge, CB3 0FD, UK

[email protected]

A draft chapter for the OUP book on Compositional methods in Physics andLinguistics. This draft formatted on 27th February 2012.

Page: 1 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43

2 Stephen Clark

1 Introduction

This chapter describes a recent development in Computational Linguistics,

in which distributional, vector-based models of meaning — which have been

successfully applied to word meanings — have been given a compositional

treatment allowing the creation of vectors for sentence meanings. More specif-

ically, this chapter presents the theoretical framework of Coecke et al. (2010),

which has been implemented in Grefenstette et al. (2011) and Grefenstette

& Sadrzadeh (2011), in a form designed to be accessible to computational

linguists not familiar with the mathematics of Category Theory on which the

framework is based.

The previous chapter has described how distributional approaches to lex-

ical semantics can be used to build vectors which represent the meanings of

words, and how those vectors can be used to calculate semantic similarity

between word meanings. It also describes the problem of creating a composi-

tional model within the vector-based framework, i.e. developing a procedure

for taking the vectors for two words (or phrases) and combining them to form

a vector for the larger phrase made up of those words. This chapter offers an

accessible presentation of a recent solution to the compositionality problem.

Another way to consider the problem is that we would like a procedure

which, given a sentence, and a vector for each word in the sentence, produces

a vector for the whole sentence. Why might such a procedure be desirable?

The first reason is that considering the problem of compositionality in nat-

ural language from a geometric viewpoint may provide an interesting new

perspective on the problem. Traditionally, compositional methods in natural

language semantics, building on the foundational work of Montague (Dowty

et al., 1981), have assumed the meanings of words to be given, and effectively


Type-Driven Syntax and Semantics for Composing Meaning Vectors 3

atomic, without any internal structure. Once we assume that the meanings

of words are vectors, with significant internal structure, then the problem of

how to compose them takes on a new light.

A second, more practical reason is that applications in Natural Language

Processing (NLP) would benefit from a framework in which the meanings of

whole sentences can be easily compared. For example, suppose that a sophisti-

cated search engine is issued the following query: Find all car showrooms with

sales on for Ford cars. Suppose futher that a web page has the heading Cheap

Fords available at the car salesroom. Knowing that the above two sentences

are similar in meaning would be of huge benefit to the search engine. Further,

if the two sentence meanings could be represented in the same vector space,

then comparing meanings for similarity is easy: simply use the cosine measure

between the sentence vectors, as is standard practice for word vectors.

One counter-argument to the above example might be that composition-

ality is not required in this case, in order to determine sentence similarity,

only similarity at the word level. For this example that may be true, but it is

uncontroversial that sentence meaning is mediated by syntactic structure. To

take another search engine example, the query A man killed his dog, entered

into Google on January 5, 2012, from the University of Cambridge Computer

Laboratory, returned a top-ranked page with the snippet Dog shoots man (as

opposed to Man shoots dog), and the third-ranked page had the snippet The

Man who Killed His Friend for Eating his Dog After it was Killed . . ..1 Of

course the order of words matters when it comes to sentence meaning.

Figure 1 shows the intuition behind comparing sentence vectors. The pre-

vious chapter explained how vectors for the words cat and dog could be created

1 Thanks to Ed Grefenstette for this example.


4 Stephen Clark

that are relatively close in “noun space” (the vector space at the top in the

figure).2 The framework described in this chapter will provide a mechanism

for creating vectors for sentences, based on the vectors for the words, so that

man killed dog and man murdered cat will be relatively close in the “sentence

space” (at the bottom of Figure 1), but crucially man killed by dog will be

located in another part of the space (since in the latter case it is the animal

killing the man, rather than vice versa). Note that, in the sentence space in

the figure, no commitment has been made regarding the basis vectors of the

sentence space (s1, s2 and s3 are not sentences, but unspecified basis vectors).

In fact, the question of what the basis vectors of the sentence space should

be is not answered by the compositional framework, but is left to the model

developer to answer. The mathematical framework simply provides a compo-

sitional device for combining vectors, assuming the sentence space is given.

Sections 4 and 5 give some examples of possible sentence spaces.

A key idea underlying the vector-based compositional framework is that

syntax drives the compositional process, in much the same way that it does in

Montague semantics (see Dowty et al. (1981) and previous chapter). Another

key idea borrowed from formal semantics is that the syntactic and semantic

descriptions will be type-driven, reflecting the fact that many word types in

natural language, such as verbs and adjectives, have a relation, or functional,

role. In fact, the syntactic formalism assumed here will be a variant of Cate-

gorial Grammar, which is the grammatical framework also used by Montague.

The next section describes pregroup grammars, which provide the syntac-

tic formalism used in Coecke et al. (2010). However, it should be noted that the

2 In practice the noun space would be many orders of magnitude larger than the

3-dimensional vector space in the figure, which has only 3 context words.



furry

-

stroke

6

pet��

��

��

cat

dog

s1

-

s2

6

s3�

��

��

��

?

man killed dog

man murdered cat

man killed by dog

Figure 1. Example vector spaces for noun and sentence meanings

use of pregroup grammars is essentially a mathematical expedient (in a way

briefly explained in the next section), and it is likely that other type-driven

formalisms, for example Combinatory Categorial Grammar (Steedman, 2000),

can be accommodated in the compositional framework.

Section 3 shows how the use of syntactic functional types leads naturally

to the use of tensor products for the meanings of words such as verbs and

adjectives. Section 4 then provides an example sentence space, and provides

some intuition for how to compose a tensor product with one of its “argu-

ments” (effectively providing the analogue of function application in the se-


6 Stephen Clark

mantic vector space). Finally, Section 5 describes a sentence space that has

been implemented in practice by Grefenstette & Sadrzadeh (2011).



2 Syntactic Types and Pregoup Grammars

The key idea in any form of Categorial Grammar is that all grammatical

constituents correspond to a syntactic type, which identifies a constituent as

either a function, from one type to another, or as an argument (Steedman

& Baldridge, 2011). Combinatory Categorial Grammar (CCG) (Steedman,

2000), following the original work of Lambek (1958), uses slash operators to

indicate the directionality of arguments. For example, the syntactic type (or

category) for a transitive verb such as likes is as follows:

likes := (S\NP)/NP

The way to read this category is that likes is the sort of verb which first

requires an NP argument to its right (note the outermost slash operator point-

ing to the right), resulting in a category which requires an NP argument to its

left (note the innermost slash operator pointing to the left), finally resulting in

a sentence (S ). Categories with slashes are known as complex categories; those

without slashes, such as S and NP , are known as basic, or atomic, categories.

A categorial grammar lexicon is a mapping from words onto sets of possible

syntactic categories for each word. In addition to the lexicon, there is a small

number of rules which combine the categories together. In classical categorial

grammar, there are only two rules, forward (>) and backward (<) application:

X /Y Y ⇒ X (>) (1)

Y X \Y ⇒ X (<) (2)


8 Stephen Clark

Investors are appealing to the Exchange Commission

NP (S [dcl ]\NP)/(S [ng ]\NP) (S [ng ]\NP)/PP PP/NP NP/N N /N N>

N>

NP>

PP>

S [ng ]\NP>

S [dcl ]\NP<

S [dcl ]

Figure 2. Example derivation using forward and backward application; grammati-cal features on the S indicate the type of the sentence, such as declarative [dcl].

These rules are technically rule schemata, in which the X and Y variables can

be replaced with any category. Figure 2, taken from Clark & Curran (2007),

gives a derivation using these rules for an example newspaper sentence.

Classical categorial grammar is context-free in terms of its generative

power. Combinatory Categorial Grammar adds a number of additional rules,

such as function composition and type-raising, which can increase the power

of the grammar to so-called mildly context-sensitive (Weir, 1988). This allows

the grammar to deal with examples of crossing dependencies attested in Dutch

and Swiss German (Shieber, 1985; Steedman, 2000), but whilst still retaining

some computationally attractive properties such as polynomial-time parsing

(Vijay-Shanker & Weir, 1993).

The mathematical move in pregroup grammars, a recent incarnation of

categorial grammar due to Lambek (2008), is to replace the slash operators

with different kinds of categories (adjoint categories), and to adopt a more

algebraic, rather than logical, perspective compared with the original work

(Lambek, 1958). The category for a transitive verb now looks as follows:

likes := NPr · S ·NP l



The first difference to notice is notational, in that the order of the cate-

gories is different: in CCG the arguments are ordered from the right in the

order in which they are cancelled; pregroups use the type-logical ordering

(Moortgat, 1997) in which left arguments appear to the left of the result, and

right arguments appear to the right.

The key difference is that a left argument is now represented as a right

adjoint: NPr, and a right argument is represented as a left adjoint: NP l; so the

adjoint categories have effectively replaced the slash operators in traditional

categorial grammar. One potential source of confusion is that arguments to

the left are right adjoints, and arguments to the right are left adjoints. The

reason is that the “cancellation rules” of pregroups state that:

X ·X r → 1 (3)

X l ·X → 1 (4)

That is, any category X cancels with its right adjoint to the right, and cancels

with its left adjoint to the left. Figure 3 gives the pregroup derivation for

the earlier example newspaper sentence, using the CCG lexical categories

translated into pregroup types.3

Mathematically we can be more precise about the pregroup cancellation

rules:

3 Pregroup derivations are usually represented as “reduction diagrams” in which

cancelling types are connected with a link, similar to dependency representations

in computational lingusitics. Here we show a categorial grammar-style derivation.


10 Stephen Clark

Investors are appealing to the Exchange Commission

NP NPr · S [dcl ] · S [ng ]l ·NP NPr · S [ng ] · PP l PP ·NP l NP ·N l N ·N l N

N

NP

PP

NPr · S [ng ]

NPr · S [dcl ]

S [dcl ]

Figure 3. Example pregroup derivation

A pregroup is a partially ordered monoid4 in which each object of the

monoid has a left and right adjoint subject to the cancellation rules

above, where → is the partial order, · is the monoid operation, and 1

is the unit object of the monoid.

In the linguistic setting, the objects of the monoid are the syntactic types;

the associative monoid operation (·) is string concatenation; the identity el-

ement is the empty string; and the partial order (→) encodes the derivation

relation. Lambek (2008) has many examples of linguistic derivations, including

a demonstration of how iterated adjoints, e.g. NP ll, can deal with interesting

syntactic phenomena such as object extraction.

It is an open question whether pregroups can provide an adequate descrip-

tion of natural languages. Pregroup grammars are context-free (Buszkowski

& Moroz, 2006), which is generally thought to be too weak to provide a full

description of the structural properties of natural language. However, many of

the combinatory rules in CCG are sound in pregroups, including forward and

backward application, forward and backward composition, and forward and

backward type-raising. In fact, the author has recently translated CCGbank

4 A monoid is a set, together with an associative binary operation, where one of

the members of the set is an identity element with respect to the operation.



(Hockenmaier & Steedman, 2007), a large corpus of English newspaper sen-

tences each with a CCG derivation, into pregroup derivations, with the one

caveat that the backward-crossed composition rule, which is frequently used

in CCGbank, is unsound in pregroups, meaning that a workaround is required

for that rule.5

There are two key ideas from this section which will carry over to the

distributional, vector-based semantics. First, linguistic constituents are rep-

resented using syntactic types, many of which are functional, or relational, in

nature. Second, there is a mechanism in the pregroup grammar for combin-

ing a functional type with an argument, using the adjoint operators and the

partial order (effectively encoding cancellation rules). Hence, there are two

key questions for the semantic analysis: one, for each syntactic type, what is

the corresponding semantic type in the world of vector spaces? And two, once

we have the semantic types represented as vectors, how can the vectors be

combined to encode a “cancellation rule” in the semantics?

Before moving to the vector-based semantics, a comment on Category The-

ory is in order. Category Theory (Lawvere & Schanuel, 1997) is an abstract

branch of mathematics which is heavily used in Coecke et al. (2010). Briefly,

a pregroup grammar can be seen as an instance of a so-called compact closed

category, in which the objects of the category are the syntactic types, the

arrows of the category are provided by the partial order, and the tensor of the

compact closed category is string concatenation (the monoidal operator). Why

is this useful? It is because vector spaces can also be seen as an instance of a

compact closed category, with an analagous structure to pregroup grammars:

5 English is a relatively simple case because of the lack of crossing dependencies;

other languages may not be so amenable to a wide-coverage pregroup treatment.


12 Stephen Clark

the objects of the category are the vector spaces, the arrows of the category

are provided by linear maps, and the tensor of the compact closed category is

the tensor product. Crucially the compact closed structure provides a mech-

anism for combining objects together in a compositional fashion. We have

already seen an instance of this mechanism in the pregroup cancellation rules;

the same mechanism (at an abstract level) will provide the recipe for com-

bining meaning vectors. Section 4 will motivate the recipe from an intuitive

perspective; readers are referred to Coecke et al. (2010) for the mathematical

details.



3 Semantic Types and Tensor Products

The question we will answer in this section is: what are the semantic types

corresponding to the syntactic types of the pregroup grammar? We will use

the type of a transitive verb in English as an example, although in principle

the same mechanism can be applied to all syntactic types.

The syntactic type for a transitive verb in English, e.g. likes, is as follows:

likes := NPr · S ·NP l

Let us assume that noun phrases live in the vector space N and that sentences

live in the vector space S. We have methods available for building the noun

space, N, detailed in the previous chapter; how to represent S is a key question

for this whole chapter — for now we simply assume that there is such a space.

Following the form of the syntactic type above, the semantic type of a

transitive verb — i.e. the vector space containing the vectors for transitive

verbs such as likes — is as follows:

−−→likes ∈ N · S ·N

Now the question becomes what should the monoidal operator (·) be in

the vector-space case? As briefly described at the end of the previous section,

Coecke et al. (2010) use category theory to motivate the use of the tensor

product as the monoidal operator which binds the individual vector spaces

together:

−−→likes ∈ N⊗ S⊗N


14 Stephen Clark

a

b

c

d

uv

(a,c)

(a,d)

u x v

(b,c)

(b,d)

(u⊗ v)(a,d) = ua . vd

Figure 4. The tensor product of two vector spaces; (u⊗ v)(a,d) is the coefficient of(u⊗ v) on the (a, d) basis vector

Figure 4 shows two vector spaces being combined together using the tensor

product. The vector space on the far left has two basis vectors, a and b, and

is combined with a vector space also with two basis vectors, c and d. The

resulting tensor product space has 2 × 2 = 4 basis vectors, given by the

cartesian product of the basis vectors of the individual spaces {a, b} × {c, d}.

Hence basis vectors in the tensor product space are pairs of basis vectors from

the individual spaces; if three vector spaces are combined, the resulting tensor

product space has basis vectors consisting of triples; and so on.

One feature of the tensor product space is that the number of dimensions

grows exponentially with the number of spaces being combined. For example,

suppose that we have a noun space with 10,000 dimensions and a sentence

space also with 10,000 dimensions, then the tensor product space N⊗S⊗N will

have 10, 0003 dimensions. We do not expect this to be a problem in practice,

since the size of the largest tensor product space will be determined by the

highest arity of a verb, which is unlikely to be higher than, say, 4 in English.



Figure 4 also shows two individual vectors, u and v, being combined (u⊗

v).6 The expression at the bottom of the figure gives the coefficient for one of

the basis vectors in the tensor product space, (a, d), for the vector u⊗ v. The

recipe is simple: take the coefficient of u on the basis vector corresponding to

the first element of the pair (coefficient denoted ua), and multiply it by the

coefficient of v on the basis vector corresponding to the second element of the

pair (coefficient denoted vd). Combining vectors in this way results in what is

called a simple or pure vector in the tensor product space.

One of the interesting properties of the tensor product space is that it is

much larger than the set of pure vectors; i.e. there are vectors in the tensor

product space in Figure 4 which cannot be obtained by combining vectors

u and v in the manner described above. It is this property of tensor prod-

ucts which allows the representation of entanglement in quantum mechanics

(Nielsen & Chuang, 2000), and leads to entangled vectors in the tensor product

space.

An individual vector for a transitive verb can now be written as follows

(Coecke et al., 2010):

Ψ =∑ijk

Cijk (−→ni ⊗−→sj ⊗−→nk) ∈ N⊗ S⊗N (5)

Here ni and nk are basis vectors in the noun space, N; sj is a basis vector

in the sentence space, S; −→ni ⊗ −→sj ⊗ −→nk is alternative notation for the basis

vector in the tensor product space (i.e. the triple 〈−→ni ,−→sj ,−→nk〉), and Cijk is the

coefficient for that basis vector.

6 The combination of two individual vectors in this way is called the Kronecker

product.


16 Stephen Clark

The intuition we would like to convey is that the vector for the verb is

relational, or functional, in nature, as it is in the syntactic case.7 Informally,

the expression in (5) for the verb vector can be read as follows:

The vector for a transitive verb can be thought of as a function, which,

given a particular basis vector from the subject, ni, and a particular

basis vector from the object, nk, returns a value Cijk for each basis

vector sj in the sentence space.

The key idea behind the use of the tensor product is that it captures

the interaction of the subject and object in the case of a transitive verb.

The fact that the tensor product effectively retains all the information from

the combining spaces — which is why the size of the tensor space grows so

quickly — is what allows this interaction to be captured. The final part of

Section 5 presses this point further by considering a simpler, but conceptually

less effective, alternative to the tensor product: the direct sum of vector spaces.

We have now answered one of the key questions for this chapter: what

is the semantic type corresponding to a particular syntactic type? The next

section answers the remaining question: how can the transitive verb in (5)

be combined with instances of a subject and object to give a vector in the

sentence space?

7 A similar intuition lies behind the use of matrices to represent the meanings of

adjectives in Baroni & Zamparelli (2010).



True

-

6

False��

��

�:

dog chases cat

apple chases orange

Figure 5. An example “plausibility space” for sentences

4 Composition for an Example Sentence Space

In this section we will use an example sentence space which can be thought

of as a “plausibility space”. Note that this is a fictitious example, in that no

such space has been built and no suggestions will be made for how it might

be built; however, it is a useful example because the sentence space is small,

with only two dimensions, and conceptually simple and easy to understand.

The next section will describe a more complex sentence space which has been

implemented. Figure 5 gives two example vectors in the plausibility space,

which has basis vectors corresponding to True and False (which can also be

thought of as “highly plausible” and “not at all plausible”). The sentence dog

chases cat is considered highly plausible, since it is close to the True basis

vector, whereas apple chases orange is considered highly implausible, since it

is close to the False basis vector.

For the rest of this section we will use the example sentence dog chases cat,

and show how a vector can be built for this sentence in the plausibility space,

assuming vectors for both nouns and the transitive verb already exist. The


18 Stephen Clark

fluffy run fast aggressive tasty buy juice fruit−→dog 0.8 0.8 0.7 0.6 0.1 0.5 0.0 0.0−→cat 0.9 0.8 0.6 0.3 0.0 0.5 0.0 0.0−−−→apple 0.0 0.0 0.0 0.0 0.9 0.9 0.8 1.0−−−−→orange 0.0 0.0 0.0 0.0 1.0 0.9 1.0 1.0

Figure 6. Example noun vectors in N

vectors assumed for the nouns are given in Figure 6, together with example

vectors for the nouns apple and orange. Note that, again, these are fictitious

counts in the table, assumed to have been obtained from analysing a corpus

and using some appropriate weighting procedure (Curran, 2004).

The compositional framework is agnostic towards the particular noun vec-

tors, in that it does not matter how those vectors are built (e.g. using a simple

window-based method, or a dependency-based method, or a dimensionality

reduction technique such as LSA (Deerwester et al., 1990)). However, for ex-

planatory purposes it will be useful to think of the basis vectors of the noun

space as corresponding to properties of the noun, obtained using the output

of a dependency parser (Curran, 2004). For example, the count for dog corre-

sponding to the basis vector fluffy is assumed to be some weighted, normalised

count of the number of times the adjective fluffy has modified the noun dog

in some corpus; intuitively this basis vector has received a high count for dog

because dogs are generally fluffy. Similarly, the basis vector buy corresponds

to the object position of the verb buy, and apple has a high count for this

basis vector because apples are the sorts of things that are bought.

Figure 7 gives example vectors for the verbs chases and eats. Note that,

since transitive verbs live in N ⊗ S ⊗N, the basis vectors across the top of

the table are triples of the form (ni, sj , nk), where ni and nk are basis vectors



〈fluffy,T,fluffy〉〈fluffy,F,fluffy〉〈fluffy,T,fast〉〈fluffy,F,fast〉〈fluffy,T,juice〉〈fluffy,F,juice〉〈tasty,T,juice〉. . .−−−−→chases 0.8 0.2 0.75 0.25 0.2 0.8 0.1−−→eats 0.7 0.3 0.6 0.4 0.9 0.1 0.1

Figure 7. Example transitive verb vectors in N⊗ S⊗N

〈fluffy,T,fluffy〉〈fluffy,F,fluffy〉〈fluffy,T,fast〉〈fluffy,F,fast〉〈fluffy,T,juice〉〈fluffy,F,juice〉〈tasty,T,juice〉. . .−−−−→chases 0.8 0.2 0.75 0.25 0.2 0.8 0.1

dog,cat 0.8,0.9 0.8,0.9 0.8,0.6 0.8,0.6 0.8,0.0 0.8,0.0 0.1,0.0

Figure 8. The vector for chases together with subject dog and object cat

from N (properties of the noun) and sj is a basis vector from S (True or False

in the plausibility space).

The way to read the fictitious numbers is as follows: chases has a high

(normalised) count for the 〈fluffy,T,fluffy〉 basis vector, because it is highly

plausible that fluffy things chase fluffy things. Conversely, chases has a low

count for the 〈fluffy,F,fluffy〉 basis vector, because it is not highly implausible

that fluffy things chase fluffy things.8 Similarly, eats has a low count for the

〈tasty,T,juice〉 basis vector, because it is not highly plausible that tasty things

eat things which can be juice.

Figure 8 gives the vector for chases, together with the corresponding counts

for the subject and object in dog chases cat. For example, the (dog, cat) count

pair for the basis vectors 〈fluffy,T,fluffy〉 and 〈fluffy,F,fluffy〉 is (0.8, 0.9), mean-

ing that dogs are fluffy to an extent 0.8, and cats are fluffy to an extent 0.9.

8 The counts in the table for (ni, T, nk) and (ni, F, nk), for some ni and nk, always

sum to 1, but this is not required and no probabilistic interpretation of the counts

is being assumed.


20 Stephen Clark

The property counts for the subject and object are taken from the noun vec-

tors in Figure 6.

The reason for presenting the vectors in the form in Figure 8 is that a

procedure for combining the vector for chases with the vectors for dog and cat

now suggests itself. How plausible is it that a dog chases a cat? Well, we know

the extent to which fluffy things chase fluffy things, and the extent to which

a dog is fluffy and a cat is fluffy; we know the extent to which things that are

bought chase things that can be juice, and we know the extent to which dogs

can be bought and cats can be juice; more generally, for each property pair,

we know the extent to which the subjects and objects of chases, in general,

embody those properties, and we know the extent to which the particular

subject and object in the sentence embody those properties. So multiplying

the corresponding numbers together brings information from both the verb

and the particular subject and object.

The calculation for the True basis vector of the sentence space for dog

chases cat is as follows:

−−−−−−−−−−→dog chases cat True = 0.8 . 0.8 . 0.9 + 0.75 . 0.8 . 0.6 + 0.2 . 0.8 . 0.0 + 0.1 . 0.1 . 0.0 + . . .

The calculation for the False basis vector is similar:

−−−−−−−−−−→dog chases cat False = 0.2 . 0.8 . 0.9 + 0.25 . 0.8 . 0.6 + 0.8 . 0.8 . 0.0 + . . .

We would expect the coefficient for the True basis vector to be much higher

than that for the False basis vector, since dog and cat embody exactly those

elements of the property pairs which score highly on the True basis vector for



man bites dog

NP NPr · S ·NP l NPN N⊗ S⊗N N

NPr · SN⊗ S

SS

Figure 9. Example pregroup derivation with semantic types

chases, and do not embody those elements of the property pairs which score

highly on the False basis vector.

Through a particular example of the sentence space, we have now derived

the expression in Coecke et al. (2010) for combining a transitive verb, −→Ψ , with

its subject, −→π , and object, −→o :

f(−→π ⊗−→Ψ ⊗−→o ) =∑ijk

Cijk〈−→π |−→πi〉−→sj 〈−→o |−→ok〉 (6)

=∑j

(∑ik

Cijk〈−→π |−→πi〉〈−→o |−→ok〉

)−→sj (7)

The expression 〈−→π |−→πi〉 is the Dirac notation for the inner product between −→π

and −→πi , and in this case the inner product is between a vector and one of its

basis vectors, so it simply returns the coefficient of −→π for the −→πi basis vector.

From the linguistic perspective, these inner products are simply picking out

particular properties of the subject, −→πi , and object, −→ok, and combining the

corresponding property coefficients with the corresponding coefficient for the

verb, Cijk, for a particular basis vector in the sentence space, sj .

Figure 9 shows a pregroup derivation for a simple transitive verb sentence,

together with the corresponding semantic types. The point of the example is


22 Stephen Clark

to demonstrate how the semantic types “become smaller” as the derivation

progresses, in much the same way that the syntactic types do. The reduction,

or cancellation, rule for the semantic component is given in (7), which can be

thought of as the semantic vector-based analogue of the syntactic reduction

rules in (3) and (4). Similar to a model-theoretic semantics, the semantic

vector for the verb can be thought of as encoding all the ways in which the verb

could interact with a subject and object, in order to produce a sentence, and

the introduction of a particular subject and object reduces those possibilities

to a single vector in the sentence space.

In summary, the meaning of a sentence w1 · · ·wn with the grammatical

(pregroup) structure p1 · · · pn →α S, where pi is the grammatical type of wi,

and→α is the pregroup reduction to a sentence, can be represented as follows:

−−−−−−→w1 · · ·wn = F (α)(−→w1 ⊗ · · · ⊗ −→wn)

Here we have generalised the previous discussion of transitive verbs and ex-

tended the idea to syntactic reductions or derivations containing any types.

The point is that the semantic reduction mechanism described in this section

can be generalised to any syntactic reduction, α, and there is a function, F ,

which, given α, produces a linear map to take the word vectors −→w1⊗ · · · ⊗−→wn

to a sentence vector −−−−−−→w1 · · ·wn. F can be thought of as Montague’s “homo-

morphic passage” from syntax to semantics.9 Coecke et al. (2010) contains a

detailed description of this more general case.

9 Note that the input to the “meaning map” is a vector in the tensor product space

obtained by combining the semantic types for all the words in the sentence; how-

ever, this vector will never be built in practice, only the vectors for the individual

words.



5 A Real Sentence Space

The sentence space in this section is taken from Grefenstette et al. (2011)

and Grefenstette & Sadrzadeh (2011). The sentence space is designed to work

with transitive verbs, and exploits a similar intuition to that used to define

the verb vector in the previous section: a vector for a transitive verb sentence

consists of pairs of properties reflecting both the properties of the subject and

object of the verb, and the properties of the subjects and objects that the

verb has in general. Pairs of properties are encoded in the N⊗N space:

−−−−−−−−−−→dog chases cat ∈ N⊗N

Given that S = N⊗N, the semantic type of a transitive verb is as follows:

−−−−→chases ∈ N⊗N⊗N⊗N

However, in this section, following Grefenstette et al. (2011) and Grefenstette

& Sadrzadeh (2011), the semantic type of the verb will be the same as the

sentence space:

−−−−→chases ∈ N⊗N

The reason for restricting the sentence space in this way is that the in-

terpretation of a particular basis vector (ni, nk) from N ⊗N, when defining

a transitive verb, is clear: the coefficient for (ni, nk) should reflect the ex-

tent to which subjects of the verb embody the ni noun property, and the

extent to which objects of the verb embody the nk property. One way to con-

sider the verb space is that we are effectively ignoring those basis vectors in


24 Stephen Clark

〈fluffy,fluffy〉〈fluffy,fast〉〈fluffy,juice〉〈tasty,juice〉〈tasty,buy〉〈buy,fruit〉〈fruit,fruit〉. . .−−−−→chases 0.8 0.75 0.2 0.1 0.2 0.2 0.0−−→eats 0.7 0.6 0.9 0.1 0.1 0.7 0.1

Figure 10. Example transitive verb vectors in N⊗N

N⊗N⊗N⊗N in which the properties of the subject and object (correspond-

ing to the outermost N vectors in the tensor product space) do not match

those of the property pairs from the sentence (the innermost N vectors).

Figure 10 shows the verb vectors from the plausibility space example,

but this time with the basis vectors as pairs of noun properties, rather than

triples consisting of pairs of noun properties together with a True or False

basis vector from the plausibility space. The intuition for the fictitious counts

is the same as before: −−−−→chases has a high coefficient for the 〈fluffy,fluffy〉 basis

vector because, in general, fluffy things chase fluffy things.

Another reason for defining the verb space as N ⊗ N is that there is a

clear experimental procedure for obtaining verb vectors from corpora: simply

count the number of times that particular property pairs from the subject and

object appear with a particular verb in the corpus. Suppose that we have a

corpus consisting of only two sentences: dog chases cat and cat chases mouse.

To obtain the vector for chases, we first increment counts for all property pairs

corresponding to (dog, cat), and then increment the counts for all property

pairs corresponding to (cat, mouse) (where the counts come from the noun

vectors, which we assume have already been built). Hence in this example

we would obtain evidence for fluffy things chasing fluffy things from the first

sentence (since dogs are fluffy and cats are fluffy); for aggressive things chasing

fluffy things (since dogs can be aggressive and cats are fluffy); some evidence

for things that can be bought chasing things that are brown from the second



〈fluffy,fluffy〉〈fluffy,fast〉〈fluffy,juice〉〈tasty,juice〉〈tasty,buy〉〈buy,fruit〉〈fruit,fruit〉. . .−−−−→chases 0.8 0.75 0.2 0.1 0.2 0.2 0.0

dog,cat 0.8,0.9 0.8,0.6 0.8,0.0 0.1,0.0 0.1,0.5 0.5,0.0 0.0,0.0

−−−−−−−−→dog chases cat 0.58 0.36 0.0 0.0 0.01 0.0 0.0

Figure 11. Combining a transitive verb in N⊗N with a subject and object to givea sentence in N⊗N

sentence (since cats can be bought and some mice are brown); and so on for

all property pairs and all examples of transitive verbs in the corpus.10

Given this particular verb space (N⊗N), and continuing with the small

corpus example, there is a neat expression for the meaning of a transitive verb

(Grefenstette & Sadrzadeh, 2011):

−−−−→chases = −→dog⊗−→cat +−→cat⊗−−−−→mouse

where −→dog⊗−→cat is the Kronecker product of −→dog and −→cat defined in Section 3.

This expression generalises in the obvious way to a corpus containing many

instances of chases.

Figure 11 shows how a transitive verb in N⊗N combines with a subject

and object, using the same compositional procedure described in the previous

section. Again the intuition is similar, but now we end up with a sentence

vector in a much larger space, N⊗N, compared with the 2-dimensional plau-

sibility space used earlier. The resulting sentence vector can be thought of

informally as answering the following set of questions: to what extent, in the

10 For examples where the subject and object are themselves compositional, we take

the counts from the head noun of the noun phrase.


26 Stephen Clark

sentence, are fluffy things interacting with fluffy things? (in this case to a large

extent, since both subject and object are fluffy, and the verb in the sentence,

chase, is such that fluffy things do chase fluffy things); to what extent, in

the sentence, are things that can be bought interacting with things that can

be juice? (in this case to a small extent, since although the subject can be

bought, the object is not something that can be juice, and the verb chase does

not generally relate things that can be bought with things that can be juice);

and so on for all the noun property pairs in N⊗N.

Again, given this particular sentence space N⊗N, we end up with a neat

expression for how to combine a transitive verb with its subject and object

(Grefenstette & Sadrzadeh, 2011):

−−−−−−−−−−→dog chases cat = −−−−→chases�(−→dog⊗−→cat

)(8)

where � is pointwise multiplication. Expressing the combination of verb and

arguments in this way makes it clear how both information from the arguments

and the verb itself is used to produce the sentence vector. For each property

pair in the sentence space,−→dog⊗−→cat captures the extent to which the arguments

of the verb satisfy those properties, and the coefficient for each property pair

is multiplied by the corresponding coefficient for the verb, which captures the

extent to which arguments of the verb in general satisfy those properties. But

note that this neat expression only arises because of the choice of N ⊗N as

the verb and sentence space; the expression above is not true in the general

case.

The question of how to evaluate the models described in this chapter, and

compositional distributional models more generally, is an important one, but

one that will be discussed only briefly here. The previous chapter discussed



the issue of evaluation, and described a recent method of using compositional

distributional models to disambiguate verbs in context (Mitchell & Lapata,

2008). The task is to use a compositional distributional model to assign sim-

ilarity scores to pairs such as (the face glowed, the face beamed) and (the face

glowed, the face burned), and compare these scores to human judgements. In

this case the first pair would be expected to obtain the higher score, but if

the subject were fire rather than face, then burned rather than beamed would

score higher. Hence in this case the subject of the intransitive verb is effec-

tively being used to disambiguate the verb.

Grefenstette & Sadrzadeh (2011) extend this evaluation to transitive verbs,

so that there is now an object as well as subject, with the idea that having an

additional argument makes this a stronger test of compositionality. Here the

pairs are cases such as (the people tried the door, the people tested the door) and

(the people tried the door, the people judged the door). In this case the first pair

would be expected to get the higher similarity score, whereas if the subject

and object were tribunal and crime, then judged would be expected to score

higher than tested. Grefenstette & Sadrzadeh (2011) find that the model de-

scribed in this section performs at least as well as the best-performing method

from Mitchell & Lapata (2008), which involves building vectors for verbs and

arguments using the context-method (not treating verbs as relational) and

then using pointwise multiplication to combine the verb with its arguments.

The previous chapter discusses this method of evaluation and suggests

that it is essentially a method of disambiguation, and so it is perhaps not

surprising that multiplicative methods perform so well, since the contextual

elements which the verb and arguments have in common will be emphasised

in the multiplicative combination. Note that, given the choice of N⊗N as the


28 Stephen Clark

verb and sentence space, the compositional procedure for the model in this

section reduces to a multiplicative combination (equation 8).

In the final part of this section we return to the question of why the

tensor product is used to bind the vector spaces together in relational types,

by comparing it with an alternative, the direct sum. The direct sum is also

an operator on vector spaces, but rather than create basis vectors by taking

the cartesian product of the basis vectors of the spaces being combined, it

effectively retains the basis vectors of the combining spaces as independent

vectors. So the number of dimensions for the direct sum of V and W, V⊕W,

is |V|+ |W| where |V| is the number of dimensions in V, rather than |V|.|W|

as in the tensor product case.

If the direct sum were being used for the semantic type of a transitive

verb, then the vector space in which transitive verbs live would be N⊕S⊕N

and the general expression for a transitive verb vector Ψ would be as follows:

Ψ =∑ijk

Ci−→ni + Cj−→sj + Ck−→nk ∈ N⊕ S⊕N (9)

The obvious way to adapt the method for building verb vectors detailed in

the previous section to a direct sum representation is as follows. Rather than

have the verb and sentence live in the N⊗N space, as before, suppose now that

verbs and sentences live in N⊕N. Again suppose that we have a corpus con-

sisting of only two sentences: dog chases cat and cat chases mouse. To obtain

the vector for chases, we follow a similar procedure to before: first increment

counts for all property pairs corresponding to (dog, cat), and then increment

the counts for all property pairs corresponding to (cat, mouse) (where again

the counts come from the noun vectors, which we assume have already been

built). But now there is a crucial difference: when we increment the counts for



verb

object subject

meaning of sentence

=

Figure 12. The verb needs to interact with its subject and object, which it doeswith the tensor product on the left, but not with the direct sum on the right

the (fluffy, buy) pair, for example, there is no interaction between the subject

and object. Whereas in the tensor product case we were able to represent

the fact that fluffy things chase things that can be bought, in the direct sum

case we can only represent the fact that fluffy things chase, and that things

that can be bought get chased, but not the combination of the two. So more

generally the direct sum can represent the properties of subjects of a verb,

and the properties of objects, but not the pairs of properties which are seen

together with the verb.

Figure 12 is take from Coecke et al. (2010) and expresses the interactive

nature of the tensor product on the left, as opposed to the direct sum on the

right. The figure is intended to represent the fact that, with the direct sum,

the subject, object and resulting sentence are entirely independent of each

other.


30 Stephen Clark

6 Conclusion and Further Work

There are some obvious ways in which the existing work could be extended.

First, the present chapter has only presented a procedure for building vec-

tors for sentences with a simple transitive verb structure. The mathematical

framework in Coecke et al. (2010) is general and in principle applies to any

syntactic reduction from any sequence of syntactic types, but how to build

relational vectors for all complex types is an open question. Second, it needs

to be demonstrated that the compositional distributional representations pre-

sented in this chapter can be useful for language processing tasks and appli-

cations. Third, there is a large part of natural language semantics, much of

which is the focus of traditional formal semantics, such as logical operators,

quantification, inference, and so on, which has been ignored in this chapter.

We see distributional models of semantics as essentially providing a seman-

tics of similarity, but whether the more traditional questions of semantics

can be accommodated in distributional semantics is an interesting and open

question.11 Fourth, there is the question of whether the vectors for relational

types can be driven more from a machine learning perspective, rather than

have their form determined by linguistic intuition (as was the case for the

verb vectors living in N⊗N). Addressing this question would bring together

the work in distributional semantics with the recent work in so-called deep

learning in the neural networks research community.12 And finally, there is the

question of how sentences should be represented as vectors. Two possibilities

11 There is some preliminary work in this direction, e.g. (Preller & Sadrzadeh, 2009;

Clarke, 2008; Widdows, 2004).12 See the proceedings for the NIPS workshops in 2010 and 2011 on Deep Learning

and Unsupervised Feature Learning.



have been suggested in this chapter, but there are presumably many more;

it may be that the form of the sentence space should be determined by the

particular language processing task in hand, e.g. sentiment analysis (Socher

et al., 2011).

In summary, this chapter is a presentation of Coecke et al. (2010) de-

signed to be accessible to computational linguists. The key innovations in this

work are the use of complex vector spaces for relational types such as verbs,

through the use of the tensor product, and a general method for composing

a relational type (represented as a vector in a tensor product space) with its

vector arguments.


32 Stephen Clark

7 Acknowledgements

Almost all of the ideas in this chapter have arisen from discussions with

Merhnoosh Sadrzadeh, Ed Grefenstette, Bob Coecke and Stephen Pulman.



References

Baroni, M. & R. Zamparelli (2010), Nouns are vectors, adjectives are matrices:

Representing adjective-noun constructions in semantic space, in Conference on

Empirical Methods in Natural Language Processing (EMNLP-10), Cambridge,

MA.

Buszkowski, Wojciech & Katarzyna Moroz (2006), Pregroup grammars and context-

free grammars, in C. Casadio & J. Lambek (eds.), Computational algebraic ap-

proaches to natural language, Polimetrica, (1–21).

Clark, Stephen & James R. Curran (2007), Wide-coverage efficient statistical parsing

with CCG and log-linear models, Computational Linguistics 33(4):493–552.

Clarke, Daoud (2008), Context-theoretic Semantics for Natural Language: An Alge-

braic Framework, Ph.D. thesis, University of Sussex.

Coecke, B., M. Sadrzadeh, & S. Clark (2010), Mathematical foundations for a com-

positional distributional model of meaning, in J. van Bentham, M. Moortgat,

& W. Buszkowski (eds.), Linguistic Analysis (Lambek Festschrift), volume 36,

(345–384).

Curran, James R. (2004), From Distributional to Semantic Similarity, Ph.D. thesis,

University of Edinburgh.

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, &

Richard Harshman (1990), Indexing by latent semantic analysis, Journal of the

American Society for Information Science 41(6):391–407.

Dowty, D.R., R.E. Wall, & S. Peters (1981), Introduction to Montague Semantics,

Dordrecht.

Grefenstette, Edward & Mehrnoosh Sadrzadeh (2011), Experimental support for a

categorical compositional distributional model of meaning, in Proceedings of the

2011 Conference on Empirical Methods in Natural Language Processing, Asso-

ciation for Computational Linguistics, Edinburgh, Scotland, UK., (1394–1404).


34 Stephen Clark

Grefenstette, Edward, Mehrnoosh Sadrzadeh, Stephen Clark, Bob Coecke, &

Stephen Pulman (2011), Concrete sentence spaces for compositional distribu-

tional models of meaning, in Proceedings of the 9th International Conference on

Computational Semantics (IWCS-11), Oxford, UK, (125–134).

Hockenmaier, Julia & Mark Steedman (2007), CCGbank: a corpus of CCG deriva-

tions and dependency structures extracted from the Penn Treebank, Computa-

tional Linguistics 33(3):355–396.

Lambek, Joachim (1958), The mathematics of sentence structure, American Math-

ematical Monthly (65):154–170.

Lambek, Joachim (2008), From Word to Sentence. A Computational Algebraic Ap-

proach to Grammar, Polimetrica.

Lawvere, F. William & Stephen Hoel Schanuel (1997), Conceptual Mathematics: A

First Introduction to Categories, Cambridge University Press.

Mitchell, Jeff & Mirella Lapata (2008), Vector-based models of semantic composi-

tion, in Proceedings of ACL-08, Columbus, OH, (236–244).

Moortgat, Michael (1997), Categorial type logics, in Johan van Benthem & Alice

ter Meulen (eds.), Handbook of Logic and Language, Elsevier, Amsterdam and

MIT Press, Cambridge MA, chapter 2, (93–177).

Nielsen, Michael A. & Isaac L. Chuang (2000), Quantum Computation and Quantum

Information, Cambridge University Press.

Preller, Anne & Mehrnoosh Sadrzadeh (2009), Bell states as negation in natural

languages, ENTCS, QPL .

Shieber, Stuart M. (1985), Evidence against the context-freeness of natural language,

Linguistics and Philosophy 8:333–343.

Socher, Richard, Jeffrey Pennington, Eric Huang, Andrew Y. Ng, & Christopher D.

Manning (2011), Semi-supervised recursive autoencoders for predicting senti-

ment distributions, in Proceedings of the Conference on Empirical Methods in

Natural Language Processing, Edinburgh, UK.

Steedman, Mark (2000), The Syntactic Process, The MIT Press, Cambridge, MA.



Steedman, Mark & Jason Baldridge (2011), Combinatory categorial grammar, in

Robert Borsley & Kersti Borjars (eds.), Non-Transformational Syntax: Formal

and Explicit Models of Grammar, Wiley-Blackwell.

Vijay-Shanker, K. & David Weir (1993), Parsing some constrained grammar for-

malisms, Computational Linguistics 19:591–636.

Weir, David (1988), Characterizing Mildly Context-Sensitive Grammar Formalisms,

Ph.D. thesis, University of Pennsylviania.

Widdows, Dominic (2004), Geometry and Meaning, CSLI Publications, Stanford

University.


Type-Driven Syntax and Semantics for Composing Meaning Vectorssc609/pubs/comp_draft.pdf ·...

Documents

Transcript of Type-Driven Syntax and Semantics for Composing Meaning Vectorssc609/pubs/comp_draft.pdf ·...