Type-Driven Syntax and Semantics for Composing Meaning Vectorssc609/pubs/comp_draft.pdf ·...
Transcript of Type-Driven Syntax and Semantics for Composing Meaning Vectorssc609/pubs/comp_draft.pdf ·...
Type-Driven Syntax and Semantics for
Composing Meaning Vectors
Stephen Clark
University of Cambridge Computer Laboratory
William Gates Building, 15 JJ Thomson Avenue
Cambridge, CB3 0FD, UK
A draft chapter for the OUP book on Compositional methods in Physics andLinguistics. This draft formatted on 27th February 2012.
Page: 1 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
2 Stephen Clark
1 Introduction
This chapter describes a recent development in Computational Linguistics,
in which distributional, vector-based models of meaning — which have been
successfully applied to word meanings — have been given a compositional
treatment allowing the creation of vectors for sentence meanings. More specif-
ically, this chapter presents the theoretical framework of Coecke et al. (2010),
which has been implemented in Grefenstette et al. (2011) and Grefenstette
& Sadrzadeh (2011), in a form designed to be accessible to computational
linguists not familiar with the mathematics of Category Theory on which the
framework is based.
The previous chapter has described how distributional approaches to lex-
ical semantics can be used to build vectors which represent the meanings of
words, and how those vectors can be used to calculate semantic similarity
between word meanings. It also describes the problem of creating a composi-
tional model within the vector-based framework, i.e. developing a procedure
for taking the vectors for two words (or phrases) and combining them to form
a vector for the larger phrase made up of those words. This chapter offers an
accessible presentation of a recent solution to the compositionality problem.
Another way to consider the problem is that we would like a procedure
which, given a sentence, and a vector for each word in the sentence, produces
a vector for the whole sentence. Why might such a procedure be desirable?
The first reason is that considering the problem of compositionality in nat-
ural language from a geometric viewpoint may provide an interesting new
perspective on the problem. Traditionally, compositional methods in natural
language semantics, building on the foundational work of Montague (Dowty
et al., 1981), have assumed the meanings of words to be given, and effectively
Page: 2 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 3
atomic, without any internal structure. Once we assume that the meanings
of words are vectors, with significant internal structure, then the problem of
how to compose them takes on a new light.
A second, more practical reason is that applications in Natural Language
Processing (NLP) would benefit from a framework in which the meanings of
whole sentences can be easily compared. For example, suppose that a sophisti-
cated search engine is issued the following query: Find all car showrooms with
sales on for Ford cars. Suppose futher that a web page has the heading Cheap
Fords available at the car salesroom. Knowing that the above two sentences
are similar in meaning would be of huge benefit to the search engine. Further,
if the two sentence meanings could be represented in the same vector space,
then comparing meanings for similarity is easy: simply use the cosine measure
between the sentence vectors, as is standard practice for word vectors.
One counter-argument to the above example might be that composition-
ality is not required in this case, in order to determine sentence similarity,
only similarity at the word level. For this example that may be true, but it is
uncontroversial that sentence meaning is mediated by syntactic structure. To
take another search engine example, the query A man killed his dog, entered
into Google on January 5, 2012, from the University of Cambridge Computer
Laboratory, returned a top-ranked page with the snippet Dog shoots man (as
opposed to Man shoots dog), and the third-ranked page had the snippet The
Man who Killed His Friend for Eating his Dog After it was Killed . . ..1 Of
course the order of words matters when it comes to sentence meaning.
Figure 1 shows the intuition behind comparing sentence vectors. The pre-
vious chapter explained how vectors for the words cat and dog could be created
1 Thanks to Ed Grefenstette for this example.
Page: 3 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
4 Stephen Clark
that are relatively close in “noun space” (the vector space at the top in the
figure).2 The framework described in this chapter will provide a mechanism
for creating vectors for sentences, based on the vectors for the words, so that
man killed dog and man murdered cat will be relatively close in the “sentence
space” (at the bottom of Figure 1), but crucially man killed by dog will be
located in another part of the space (since in the latter case it is the animal
killing the man, rather than vice versa). Note that, in the sentence space in
the figure, no commitment has been made regarding the basis vectors of the
sentence space (s1, s2 and s3 are not sentences, but unspecified basis vectors).
In fact, the question of what the basis vectors of the sentence space should
be is not answered by the compositional framework, but is left to the model
developer to answer. The mathematical framework simply provides a compo-
sitional device for combining vectors, assuming the sentence space is given.
Sections 4 and 5 give some examples of possible sentence spaces.
A key idea underlying the vector-based compositional framework is that
syntax drives the compositional process, in much the same way that it does in
Montague semantics (see Dowty et al. (1981) and previous chapter). Another
key idea borrowed from formal semantics is that the syntactic and semantic
descriptions will be type-driven, reflecting the fact that many word types in
natural language, such as verbs and adjectives, have a relation, or functional,
role. In fact, the syntactic formalism assumed here will be a variant of Cate-
gorial Grammar, which is the grammatical framework also used by Montague.
The next section describes pregroup grammars, which provide the syntac-
tic formalism used in Coecke et al. (2010). However, it should be noted that the
2 In practice the noun space would be many orders of magnitude larger than the
3-dimensional vector space in the figure, which has only 3 context words.
Page: 4 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 5
furry
-
stroke
6
pet���
�������
�����
cat
dog
s1
-
s2
6
s3�
����
������
����
?
man killed dog
man murdered cat
man killed by dog
Figure 1. Example vector spaces for noun and sentence meanings
use of pregroup grammars is essentially a mathematical expedient (in a way
briefly explained in the next section), and it is likely that other type-driven
formalisms, for example Combinatory Categorial Grammar (Steedman, 2000),
can be accommodated in the compositional framework.
Section 3 shows how the use of syntactic functional types leads naturally
to the use of tensor products for the meanings of words such as verbs and
adjectives. Section 4 then provides an example sentence space, and provides
some intuition for how to compose a tensor product with one of its “argu-
ments” (effectively providing the analogue of function application in the se-
Page: 5 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
6 Stephen Clark
mantic vector space). Finally, Section 5 describes a sentence space that has
been implemented in practice by Grefenstette & Sadrzadeh (2011).
Page: 6 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 7
2 Syntactic Types and Pregoup Grammars
The key idea in any form of Categorial Grammar is that all grammatical
constituents correspond to a syntactic type, which identifies a constituent as
either a function, from one type to another, or as an argument (Steedman
& Baldridge, 2011). Combinatory Categorial Grammar (CCG) (Steedman,
2000), following the original work of Lambek (1958), uses slash operators to
indicate the directionality of arguments. For example, the syntactic type (or
category) for a transitive verb such as likes is as follows:
likes := (S\NP)/NP
The way to read this category is that likes is the sort of verb which first
requires an NP argument to its right (note the outermost slash operator point-
ing to the right), resulting in a category which requires an NP argument to its
left (note the innermost slash operator pointing to the left), finally resulting in
a sentence (S ). Categories with slashes are known as complex categories; those
without slashes, such as S and NP , are known as basic, or atomic, categories.
A categorial grammar lexicon is a mapping from words onto sets of possible
syntactic categories for each word. In addition to the lexicon, there is a small
number of rules which combine the categories together. In classical categorial
grammar, there are only two rules, forward (>) and backward (<) application:
X /Y Y ⇒ X (>) (1)
Y X \Y ⇒ X (<) (2)
Page: 7 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
8 Stephen Clark
Investors are appealing to the Exchange Commission
NP (S [dcl ]\NP)/(S [ng ]\NP) (S [ng ]\NP)/PP PP/NP NP/N N /N N>
N>
NP>
PP>
S [ng ]\NP>
S [dcl ]\NP<
S [dcl ]
Figure 2. Example derivation using forward and backward application; grammati-cal features on the S indicate the type of the sentence, such as declarative [dcl].
These rules are technically rule schemata, in which the X and Y variables can
be replaced with any category. Figure 2, taken from Clark & Curran (2007),
gives a derivation using these rules for an example newspaper sentence.
Classical categorial grammar is context-free in terms of its generative
power. Combinatory Categorial Grammar adds a number of additional rules,
such as function composition and type-raising, which can increase the power
of the grammar to so-called mildly context-sensitive (Weir, 1988). This allows
the grammar to deal with examples of crossing dependencies attested in Dutch
and Swiss German (Shieber, 1985; Steedman, 2000), but whilst still retaining
some computationally attractive properties such as polynomial-time parsing
(Vijay-Shanker & Weir, 1993).
The mathematical move in pregroup grammars, a recent incarnation of
categorial grammar due to Lambek (2008), is to replace the slash operators
with different kinds of categories (adjoint categories), and to adopt a more
algebraic, rather than logical, perspective compared with the original work
(Lambek, 1958). The category for a transitive verb now looks as follows:
likes := NPr · S ·NP l
Page: 8 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 9
The first difference to notice is notational, in that the order of the cate-
gories is different: in CCG the arguments are ordered from the right in the
order in which they are cancelled; pregroups use the type-logical ordering
(Moortgat, 1997) in which left arguments appear to the left of the result, and
right arguments appear to the right.
The key difference is that a left argument is now represented as a right
adjoint: NPr, and a right argument is represented as a left adjoint: NP l; so the
adjoint categories have effectively replaced the slash operators in traditional
categorial grammar. One potential source of confusion is that arguments to
the left are right adjoints, and arguments to the right are left adjoints. The
reason is that the “cancellation rules” of pregroups state that:
X ·X r → 1 (3)
X l ·X → 1 (4)
That is, any category X cancels with its right adjoint to the right, and cancels
with its left adjoint to the left. Figure 3 gives the pregroup derivation for
the earlier example newspaper sentence, using the CCG lexical categories
translated into pregroup types.3
Mathematically we can be more precise about the pregroup cancellation
rules:
3 Pregroup derivations are usually represented as “reduction diagrams” in which
cancelling types are connected with a link, similar to dependency representations
in computational lingusitics. Here we show a categorial grammar-style derivation.
Page: 9 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
10 Stephen Clark
Investors are appealing to the Exchange Commission
NP NPr · S [dcl ] · S [ng ]l ·NP NPr · S [ng ] · PP l PP ·NP l NP ·N l N ·N l N
N
NP
PP
NPr · S [ng ]
NPr · S [dcl ]
S [dcl ]
Figure 3. Example pregroup derivation
A pregroup is a partially ordered monoid4 in which each object of the
monoid has a left and right adjoint subject to the cancellation rules
above, where → is the partial order, · is the monoid operation, and 1
is the unit object of the monoid.
In the linguistic setting, the objects of the monoid are the syntactic types;
the associative monoid operation (·) is string concatenation; the identity el-
ement is the empty string; and the partial order (→) encodes the derivation
relation. Lambek (2008) has many examples of linguistic derivations, including
a demonstration of how iterated adjoints, e.g. NP ll, can deal with interesting
syntactic phenomena such as object extraction.
It is an open question whether pregroups can provide an adequate descrip-
tion of natural languages. Pregroup grammars are context-free (Buszkowski
& Moroz, 2006), which is generally thought to be too weak to provide a full
description of the structural properties of natural language. However, many of
the combinatory rules in CCG are sound in pregroups, including forward and
backward application, forward and backward composition, and forward and
backward type-raising. In fact, the author has recently translated CCGbank
4 A monoid is a set, together with an associative binary operation, where one of
the members of the set is an identity element with respect to the operation.
Page: 10 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 11
(Hockenmaier & Steedman, 2007), a large corpus of English newspaper sen-
tences each with a CCG derivation, into pregroup derivations, with the one
caveat that the backward-crossed composition rule, which is frequently used
in CCGbank, is unsound in pregroups, meaning that a workaround is required
for that rule.5
There are two key ideas from this section which will carry over to the
distributional, vector-based semantics. First, linguistic constituents are rep-
resented using syntactic types, many of which are functional, or relational, in
nature. Second, there is a mechanism in the pregroup grammar for combin-
ing a functional type with an argument, using the adjoint operators and the
partial order (effectively encoding cancellation rules). Hence, there are two
key questions for the semantic analysis: one, for each syntactic type, what is
the corresponding semantic type in the world of vector spaces? And two, once
we have the semantic types represented as vectors, how can the vectors be
combined to encode a “cancellation rule” in the semantics?
Before moving to the vector-based semantics, a comment on Category The-
ory is in order. Category Theory (Lawvere & Schanuel, 1997) is an abstract
branch of mathematics which is heavily used in Coecke et al. (2010). Briefly,
a pregroup grammar can be seen as an instance of a so-called compact closed
category, in which the objects of the category are the syntactic types, the
arrows of the category are provided by the partial order, and the tensor of the
compact closed category is string concatenation (the monoidal operator). Why
is this useful? It is because vector spaces can also be seen as an instance of a
compact closed category, with an analagous structure to pregroup grammars:
5 English is a relatively simple case because of the lack of crossing dependencies;
other languages may not be so amenable to a wide-coverage pregroup treatment.
Page: 11 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
12 Stephen Clark
the objects of the category are the vector spaces, the arrows of the category
are provided by linear maps, and the tensor of the compact closed category is
the tensor product. Crucially the compact closed structure provides a mech-
anism for combining objects together in a compositional fashion. We have
already seen an instance of this mechanism in the pregroup cancellation rules;
the same mechanism (at an abstract level) will provide the recipe for com-
bining meaning vectors. Section 4 will motivate the recipe from an intuitive
perspective; readers are referred to Coecke et al. (2010) for the mathematical
details.
Page: 12 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 13
3 Semantic Types and Tensor Products
The question we will answer in this section is: what are the semantic types
corresponding to the syntactic types of the pregroup grammar? We will use
the type of a transitive verb in English as an example, although in principle
the same mechanism can be applied to all syntactic types.
The syntactic type for a transitive verb in English, e.g. likes, is as follows:
likes := NPr · S ·NP l
Let us assume that noun phrases live in the vector space N and that sentences
live in the vector space S. We have methods available for building the noun
space, N, detailed in the previous chapter; how to represent S is a key question
for this whole chapter — for now we simply assume that there is such a space.
Following the form of the syntactic type above, the semantic type of a
transitive verb — i.e. the vector space containing the vectors for transitive
verbs such as likes — is as follows:
−−→likes ∈ N · S ·N
Now the question becomes what should the monoidal operator (·) be in
the vector-space case? As briefly described at the end of the previous section,
Coecke et al. (2010) use category theory to motivate the use of the tensor
product as the monoidal operator which binds the individual vector spaces
together:
−−→likes ∈ N⊗ S⊗N
Page: 13 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
14 Stephen Clark
a
b
c
d
uv
(a,c)
(a,d)
u x v
(b,c)
(b,d)
(u⊗ v)(a,d) = ua . vd
Figure 4. The tensor product of two vector spaces; (u⊗ v)(a,d) is the coefficient of(u⊗ v) on the (a, d) basis vector
Figure 4 shows two vector spaces being combined together using the tensor
product. The vector space on the far left has two basis vectors, a and b, and
is combined with a vector space also with two basis vectors, c and d. The
resulting tensor product space has 2 × 2 = 4 basis vectors, given by the
cartesian product of the basis vectors of the individual spaces {a, b} × {c, d}.
Hence basis vectors in the tensor product space are pairs of basis vectors from
the individual spaces; if three vector spaces are combined, the resulting tensor
product space has basis vectors consisting of triples; and so on.
One feature of the tensor product space is that the number of dimensions
grows exponentially with the number of spaces being combined. For example,
suppose that we have a noun space with 10,000 dimensions and a sentence
space also with 10,000 dimensions, then the tensor product space N⊗S⊗N will
have 10, 0003 dimensions. We do not expect this to be a problem in practice,
since the size of the largest tensor product space will be determined by the
highest arity of a verb, which is unlikely to be higher than, say, 4 in English.
Page: 14 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 15
Figure 4 also shows two individual vectors, u and v, being combined (u⊗
v).6 The expression at the bottom of the figure gives the coefficient for one of
the basis vectors in the tensor product space, (a, d), for the vector u⊗ v. The
recipe is simple: take the coefficient of u on the basis vector corresponding to
the first element of the pair (coefficient denoted ua), and multiply it by the
coefficient of v on the basis vector corresponding to the second element of the
pair (coefficient denoted vd). Combining vectors in this way results in what is
called a simple or pure vector in the tensor product space.
One of the interesting properties of the tensor product space is that it is
much larger than the set of pure vectors; i.e. there are vectors in the tensor
product space in Figure 4 which cannot be obtained by combining vectors
u and v in the manner described above. It is this property of tensor prod-
ucts which allows the representation of entanglement in quantum mechanics
(Nielsen & Chuang, 2000), and leads to entangled vectors in the tensor product
space.
An individual vector for a transitive verb can now be written as follows
(Coecke et al., 2010):
Ψ =∑ijk
Cijk (−→ni ⊗−→sj ⊗−→nk) ∈ N⊗ S⊗N (5)
Here ni and nk are basis vectors in the noun space, N; sj is a basis vector
in the sentence space, S; −→ni ⊗ −→sj ⊗ −→nk is alternative notation for the basis
vector in the tensor product space (i.e. the triple 〈−→ni ,−→sj ,−→nk〉), and Cijk is the
coefficient for that basis vector.
6 The combination of two individual vectors in this way is called the Kronecker
product.
Page: 15 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
16 Stephen Clark
The intuition we would like to convey is that the vector for the verb is
relational, or functional, in nature, as it is in the syntactic case.7 Informally,
the expression in (5) for the verb vector can be read as follows:
The vector for a transitive verb can be thought of as a function, which,
given a particular basis vector from the subject, ni, and a particular
basis vector from the object, nk, returns a value Cijk for each basis
vector sj in the sentence space.
The key idea behind the use of the tensor product is that it captures
the interaction of the subject and object in the case of a transitive verb.
The fact that the tensor product effectively retains all the information from
the combining spaces — which is why the size of the tensor space grows so
quickly — is what allows this interaction to be captured. The final part of
Section 5 presses this point further by considering a simpler, but conceptually
less effective, alternative to the tensor product: the direct sum of vector spaces.
We have now answered one of the key questions for this chapter: what
is the semantic type corresponding to a particular syntactic type? The next
section answers the remaining question: how can the transitive verb in (5)
be combined with instances of a subject and object to give a vector in the
sentence space?
7 A similar intuition lies behind the use of matrices to represent the meanings of
adjectives in Baroni & Zamparelli (2010).
Page: 16 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 17
True
-
6
False�������
������
�:
dog chases cat
apple chases orange
Figure 5. An example “plausibility space” for sentences
4 Composition for an Example Sentence Space
In this section we will use an example sentence space which can be thought
of as a “plausibility space”. Note that this is a fictitious example, in that no
such space has been built and no suggestions will be made for how it might
be built; however, it is a useful example because the sentence space is small,
with only two dimensions, and conceptually simple and easy to understand.
The next section will describe a more complex sentence space which has been
implemented. Figure 5 gives two example vectors in the plausibility space,
which has basis vectors corresponding to True and False (which can also be
thought of as “highly plausible” and “not at all plausible”). The sentence dog
chases cat is considered highly plausible, since it is close to the True basis
vector, whereas apple chases orange is considered highly implausible, since it
is close to the False basis vector.
For the rest of this section we will use the example sentence dog chases cat,
and show how a vector can be built for this sentence in the plausibility space,
assuming vectors for both nouns and the transitive verb already exist. The
Page: 17 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
18 Stephen Clark
fluffy run fast aggressive tasty buy juice fruit−→dog 0.8 0.8 0.7 0.6 0.1 0.5 0.0 0.0−→cat 0.9 0.8 0.6 0.3 0.0 0.5 0.0 0.0−−−→apple 0.0 0.0 0.0 0.0 0.9 0.9 0.8 1.0−−−−→orange 0.0 0.0 0.0 0.0 1.0 0.9 1.0 1.0
Figure 6. Example noun vectors in N
vectors assumed for the nouns are given in Figure 6, together with example
vectors for the nouns apple and orange. Note that, again, these are fictitious
counts in the table, assumed to have been obtained from analysing a corpus
and using some appropriate weighting procedure (Curran, 2004).
The compositional framework is agnostic towards the particular noun vec-
tors, in that it does not matter how those vectors are built (e.g. using a simple
window-based method, or a dependency-based method, or a dimensionality
reduction technique such as LSA (Deerwester et al., 1990)). However, for ex-
planatory purposes it will be useful to think of the basis vectors of the noun
space as corresponding to properties of the noun, obtained using the output
of a dependency parser (Curran, 2004). For example, the count for dog corre-
sponding to the basis vector fluffy is assumed to be some weighted, normalised
count of the number of times the adjective fluffy has modified the noun dog
in some corpus; intuitively this basis vector has received a high count for dog
because dogs are generally fluffy. Similarly, the basis vector buy corresponds
to the object position of the verb buy, and apple has a high count for this
basis vector because apples are the sorts of things that are bought.
Figure 7 gives example vectors for the verbs chases and eats. Note that,
since transitive verbs live in N ⊗ S ⊗N, the basis vectors across the top of
the table are triples of the form (ni, sj , nk), where ni and nk are basis vectors
Page: 18 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 19
〈fluffy,T,fluffy〉〈fluffy,F,fluffy〉〈fluffy,T,fast〉〈fluffy,F,fast〉〈fluffy,T,juice〉〈fluffy,F,juice〉〈tasty,T,juice〉. . .−−−−→chases 0.8 0.2 0.75 0.25 0.2 0.8 0.1−−→eats 0.7 0.3 0.6 0.4 0.9 0.1 0.1
Figure 7. Example transitive verb vectors in N⊗ S⊗N
〈fluffy,T,fluffy〉〈fluffy,F,fluffy〉〈fluffy,T,fast〉〈fluffy,F,fast〉〈fluffy,T,juice〉〈fluffy,F,juice〉〈tasty,T,juice〉. . .−−−−→chases 0.8 0.2 0.75 0.25 0.2 0.8 0.1
dog,cat 0.8,0.9 0.8,0.9 0.8,0.6 0.8,0.6 0.8,0.0 0.8,0.0 0.1,0.0
Figure 8. The vector for chases together with subject dog and object cat
from N (properties of the noun) and sj is a basis vector from S (True or False
in the plausibility space).
The way to read the fictitious numbers is as follows: chases has a high
(normalised) count for the 〈fluffy,T,fluffy〉 basis vector, because it is highly
plausible that fluffy things chase fluffy things. Conversely, chases has a low
count for the 〈fluffy,F,fluffy〉 basis vector, because it is not highly implausible
that fluffy things chase fluffy things.8 Similarly, eats has a low count for the
〈tasty,T,juice〉 basis vector, because it is not highly plausible that tasty things
eat things which can be juice.
Figure 8 gives the vector for chases, together with the corresponding counts
for the subject and object in dog chases cat. For example, the (dog, cat) count
pair for the basis vectors 〈fluffy,T,fluffy〉 and 〈fluffy,F,fluffy〉 is (0.8, 0.9), mean-
ing that dogs are fluffy to an extent 0.8, and cats are fluffy to an extent 0.9.
8 The counts in the table for (ni, T, nk) and (ni, F, nk), for some ni and nk, always
sum to 1, but this is not required and no probabilistic interpretation of the counts
is being assumed.
Page: 19 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
20 Stephen Clark
The property counts for the subject and object are taken from the noun vec-
tors in Figure 6.
The reason for presenting the vectors in the form in Figure 8 is that a
procedure for combining the vector for chases with the vectors for dog and cat
now suggests itself. How plausible is it that a dog chases a cat? Well, we know
the extent to which fluffy things chase fluffy things, and the extent to which
a dog is fluffy and a cat is fluffy; we know the extent to which things that are
bought chase things that can be juice, and we know the extent to which dogs
can be bought and cats can be juice; more generally, for each property pair,
we know the extent to which the subjects and objects of chases, in general,
embody those properties, and we know the extent to which the particular
subject and object in the sentence embody those properties. So multiplying
the corresponding numbers together brings information from both the verb
and the particular subject and object.
The calculation for the True basis vector of the sentence space for dog
chases cat is as follows:
−−−−−−−−−−→dog chases cat True = 0.8 . 0.8 . 0.9 + 0.75 . 0.8 . 0.6 + 0.2 . 0.8 . 0.0 + 0.1 . 0.1 . 0.0 + . . .
The calculation for the False basis vector is similar:
−−−−−−−−−−→dog chases cat False = 0.2 . 0.8 . 0.9 + 0.25 . 0.8 . 0.6 + 0.8 . 0.8 . 0.0 + . . .
We would expect the coefficient for the True basis vector to be much higher
than that for the False basis vector, since dog and cat embody exactly those
elements of the property pairs which score highly on the True basis vector for
Page: 20 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 21
man bites dog
NP NPr · S ·NP l NPN N⊗ S⊗N N
NPr · SN⊗ S
SS
Figure 9. Example pregroup derivation with semantic types
chases, and do not embody those elements of the property pairs which score
highly on the False basis vector.
Through a particular example of the sentence space, we have now derived
the expression in Coecke et al. (2010) for combining a transitive verb, −→Ψ , with
its subject, −→π , and object, −→o :
f(−→π ⊗−→Ψ ⊗−→o ) =∑ijk
Cijk〈−→π |−→πi〉−→sj 〈−→o |−→ok〉 (6)
=∑j
(∑ik
Cijk〈−→π |−→πi〉〈−→o |−→ok〉
)−→sj (7)
The expression 〈−→π |−→πi〉 is the Dirac notation for the inner product between −→π
and −→πi , and in this case the inner product is between a vector and one of its
basis vectors, so it simply returns the coefficient of −→π for the −→πi basis vector.
From the linguistic perspective, these inner products are simply picking out
particular properties of the subject, −→πi , and object, −→ok, and combining the
corresponding property coefficients with the corresponding coefficient for the
verb, Cijk, for a particular basis vector in the sentence space, sj .
Figure 9 shows a pregroup derivation for a simple transitive verb sentence,
together with the corresponding semantic types. The point of the example is
Page: 21 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
22 Stephen Clark
to demonstrate how the semantic types “become smaller” as the derivation
progresses, in much the same way that the syntactic types do. The reduction,
or cancellation, rule for the semantic component is given in (7), which can be
thought of as the semantic vector-based analogue of the syntactic reduction
rules in (3) and (4). Similar to a model-theoretic semantics, the semantic
vector for the verb can be thought of as encoding all the ways in which the verb
could interact with a subject and object, in order to produce a sentence, and
the introduction of a particular subject and object reduces those possibilities
to a single vector in the sentence space.
In summary, the meaning of a sentence w1 · · ·wn with the grammatical
(pregroup) structure p1 · · · pn →α S, where pi is the grammatical type of wi,
and→α is the pregroup reduction to a sentence, can be represented as follows:
−−−−−−→w1 · · ·wn = F (α)(−→w1 ⊗ · · · ⊗ −→wn)
Here we have generalised the previous discussion of transitive verbs and ex-
tended the idea to syntactic reductions or derivations containing any types.
The point is that the semantic reduction mechanism described in this section
can be generalised to any syntactic reduction, α, and there is a function, F ,
which, given α, produces a linear map to take the word vectors −→w1⊗ · · · ⊗−→wn
to a sentence vector −−−−−−→w1 · · ·wn. F can be thought of as Montague’s “homo-
morphic passage” from syntax to semantics.9 Coecke et al. (2010) contains a
detailed description of this more general case.
9 Note that the input to the “meaning map” is a vector in the tensor product space
obtained by combining the semantic types for all the words in the sentence; how-
ever, this vector will never be built in practice, only the vectors for the individual
words.
Page: 22 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 23
5 A Real Sentence Space
The sentence space in this section is taken from Grefenstette et al. (2011)
and Grefenstette & Sadrzadeh (2011). The sentence space is designed to work
with transitive verbs, and exploits a similar intuition to that used to define
the verb vector in the previous section: a vector for a transitive verb sentence
consists of pairs of properties reflecting both the properties of the subject and
object of the verb, and the properties of the subjects and objects that the
verb has in general. Pairs of properties are encoded in the N⊗N space:
−−−−−−−−−−→dog chases cat ∈ N⊗N
Given that S = N⊗N, the semantic type of a transitive verb is as follows:
−−−−→chases ∈ N⊗N⊗N⊗N
However, in this section, following Grefenstette et al. (2011) and Grefenstette
& Sadrzadeh (2011), the semantic type of the verb will be the same as the
sentence space:
−−−−→chases ∈ N⊗N
The reason for restricting the sentence space in this way is that the in-
terpretation of a particular basis vector (ni, nk) from N ⊗N, when defining
a transitive verb, is clear: the coefficient for (ni, nk) should reflect the ex-
tent to which subjects of the verb embody the ni noun property, and the
extent to which objects of the verb embody the nk property. One way to con-
sider the verb space is that we are effectively ignoring those basis vectors in
Page: 23 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
24 Stephen Clark
〈fluffy,fluffy〉〈fluffy,fast〉〈fluffy,juice〉〈tasty,juice〉〈tasty,buy〉〈buy,fruit〉〈fruit,fruit〉. . .−−−−→chases 0.8 0.75 0.2 0.1 0.2 0.2 0.0−−→eats 0.7 0.6 0.9 0.1 0.1 0.7 0.1
Figure 10. Example transitive verb vectors in N⊗N
N⊗N⊗N⊗N in which the properties of the subject and object (correspond-
ing to the outermost N vectors in the tensor product space) do not match
those of the property pairs from the sentence (the innermost N vectors).
Figure 10 shows the verb vectors from the plausibility space example,
but this time with the basis vectors as pairs of noun properties, rather than
triples consisting of pairs of noun properties together with a True or False
basis vector from the plausibility space. The intuition for the fictitious counts
is the same as before: −−−−→chases has a high coefficient for the 〈fluffy,fluffy〉 basis
vector because, in general, fluffy things chase fluffy things.
Another reason for defining the verb space as N ⊗ N is that there is a
clear experimental procedure for obtaining verb vectors from corpora: simply
count the number of times that particular property pairs from the subject and
object appear with a particular verb in the corpus. Suppose that we have a
corpus consisting of only two sentences: dog chases cat and cat chases mouse.
To obtain the vector for chases, we first increment counts for all property pairs
corresponding to (dog, cat), and then increment the counts for all property
pairs corresponding to (cat, mouse) (where the counts come from the noun
vectors, which we assume have already been built). Hence in this example
we would obtain evidence for fluffy things chasing fluffy things from the first
sentence (since dogs are fluffy and cats are fluffy); for aggressive things chasing
fluffy things (since dogs can be aggressive and cats are fluffy); some evidence
for things that can be bought chasing things that are brown from the second
Page: 24 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 25
〈fluffy,fluffy〉〈fluffy,fast〉〈fluffy,juice〉〈tasty,juice〉〈tasty,buy〉〈buy,fruit〉〈fruit,fruit〉. . .−−−−→chases 0.8 0.75 0.2 0.1 0.2 0.2 0.0
dog,cat 0.8,0.9 0.8,0.6 0.8,0.0 0.1,0.0 0.1,0.5 0.5,0.0 0.0,0.0
−−−−−−−−→dog chases cat 0.58 0.36 0.0 0.0 0.01 0.0 0.0
Figure 11. Combining a transitive verb in N⊗N with a subject and object to givea sentence in N⊗N
sentence (since cats can be bought and some mice are brown); and so on for
all property pairs and all examples of transitive verbs in the corpus.10
Given this particular verb space (N⊗N), and continuing with the small
corpus example, there is a neat expression for the meaning of a transitive verb
(Grefenstette & Sadrzadeh, 2011):
−−−−→chases = −→dog⊗−→cat +−→cat⊗−−−−→mouse
where −→dog⊗−→cat is the Kronecker product of −→dog and −→cat defined in Section 3.
This expression generalises in the obvious way to a corpus containing many
instances of chases.
Figure 11 shows how a transitive verb in N⊗N combines with a subject
and object, using the same compositional procedure described in the previous
section. Again the intuition is similar, but now we end up with a sentence
vector in a much larger space, N⊗N, compared with the 2-dimensional plau-
sibility space used earlier. The resulting sentence vector can be thought of
informally as answering the following set of questions: to what extent, in the
10 For examples where the subject and object are themselves compositional, we take
the counts from the head noun of the noun phrase.
Page: 25 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
26 Stephen Clark
sentence, are fluffy things interacting with fluffy things? (in this case to a large
extent, since both subject and object are fluffy, and the verb in the sentence,
chase, is such that fluffy things do chase fluffy things); to what extent, in
the sentence, are things that can be bought interacting with things that can
be juice? (in this case to a small extent, since although the subject can be
bought, the object is not something that can be juice, and the verb chase does
not generally relate things that can be bought with things that can be juice);
and so on for all the noun property pairs in N⊗N.
Again, given this particular sentence space N⊗N, we end up with a neat
expression for how to combine a transitive verb with its subject and object
(Grefenstette & Sadrzadeh, 2011):
−−−−−−−−−−→dog chases cat = −−−−→chases�(−→dog⊗−→cat
)(8)
where � is pointwise multiplication. Expressing the combination of verb and
arguments in this way makes it clear how both information from the arguments
and the verb itself is used to produce the sentence vector. For each property
pair in the sentence space,−→dog⊗−→cat captures the extent to which the arguments
of the verb satisfy those properties, and the coefficient for each property pair
is multiplied by the corresponding coefficient for the verb, which captures the
extent to which arguments of the verb in general satisfy those properties. But
note that this neat expression only arises because of the choice of N ⊗N as
the verb and sentence space; the expression above is not true in the general
case.
The question of how to evaluate the models described in this chapter, and
compositional distributional models more generally, is an important one, but
one that will be discussed only briefly here. The previous chapter discussed
Page: 26 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 27
the issue of evaluation, and described a recent method of using compositional
distributional models to disambiguate verbs in context (Mitchell & Lapata,
2008). The task is to use a compositional distributional model to assign sim-
ilarity scores to pairs such as (the face glowed, the face beamed) and (the face
glowed, the face burned), and compare these scores to human judgements. In
this case the first pair would be expected to obtain the higher score, but if
the subject were fire rather than face, then burned rather than beamed would
score higher. Hence in this case the subject of the intransitive verb is effec-
tively being used to disambiguate the verb.
Grefenstette & Sadrzadeh (2011) extend this evaluation to transitive verbs,
so that there is now an object as well as subject, with the idea that having an
additional argument makes this a stronger test of compositionality. Here the
pairs are cases such as (the people tried the door, the people tested the door) and
(the people tried the door, the people judged the door). In this case the first pair
would be expected to get the higher similarity score, whereas if the subject
and object were tribunal and crime, then judged would be expected to score
higher than tested. Grefenstette & Sadrzadeh (2011) find that the model de-
scribed in this section performs at least as well as the best-performing method
from Mitchell & Lapata (2008), which involves building vectors for verbs and
arguments using the context-method (not treating verbs as relational) and
then using pointwise multiplication to combine the verb with its arguments.
The previous chapter discusses this method of evaluation and suggests
that it is essentially a method of disambiguation, and so it is perhaps not
surprising that multiplicative methods perform so well, since the contextual
elements which the verb and arguments have in common will be emphasised
in the multiplicative combination. Note that, given the choice of N⊗N as the
Page: 27 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
28 Stephen Clark
verb and sentence space, the compositional procedure for the model in this
section reduces to a multiplicative combination (equation 8).
In the final part of this section we return to the question of why the
tensor product is used to bind the vector spaces together in relational types,
by comparing it with an alternative, the direct sum. The direct sum is also
an operator on vector spaces, but rather than create basis vectors by taking
the cartesian product of the basis vectors of the spaces being combined, it
effectively retains the basis vectors of the combining spaces as independent
vectors. So the number of dimensions for the direct sum of V and W, V⊕W,
is |V|+ |W| where |V| is the number of dimensions in V, rather than |V|.|W|
as in the tensor product case.
If the direct sum were being used for the semantic type of a transitive
verb, then the vector space in which transitive verbs live would be N⊕S⊕N
and the general expression for a transitive verb vector Ψ would be as follows:
Ψ =∑ijk
Ci−→ni + Cj−→sj + Ck−→nk ∈ N⊕ S⊕N (9)
The obvious way to adapt the method for building verb vectors detailed in
the previous section to a direct sum representation is as follows. Rather than
have the verb and sentence live in the N⊗N space, as before, suppose now that
verbs and sentences live in N⊕N. Again suppose that we have a corpus con-
sisting of only two sentences: dog chases cat and cat chases mouse. To obtain
the vector for chases, we follow a similar procedure to before: first increment
counts for all property pairs corresponding to (dog, cat), and then increment
the counts for all property pairs corresponding to (cat, mouse) (where again
the counts come from the noun vectors, which we assume have already been
built). But now there is a crucial difference: when we increment the counts for
Page: 28 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 29
verb
object subject
meaning of sentence
=
Figure 12. The verb needs to interact with its subject and object, which it doeswith the tensor product on the left, but not with the direct sum on the right
the (fluffy, buy) pair, for example, there is no interaction between the subject
and object. Whereas in the tensor product case we were able to represent
the fact that fluffy things chase things that can be bought, in the direct sum
case we can only represent the fact that fluffy things chase, and that things
that can be bought get chased, but not the combination of the two. So more
generally the direct sum can represent the properties of subjects of a verb,
and the properties of objects, but not the pairs of properties which are seen
together with the verb.
Figure 12 is take from Coecke et al. (2010) and expresses the interactive
nature of the tensor product on the left, as opposed to the direct sum on the
right. The figure is intended to represent the fact that, with the direct sum,
the subject, object and resulting sentence are entirely independent of each
other.
Page: 29 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
30 Stephen Clark
6 Conclusion and Further Work
There are some obvious ways in which the existing work could be extended.
First, the present chapter has only presented a procedure for building vec-
tors for sentences with a simple transitive verb structure. The mathematical
framework in Coecke et al. (2010) is general and in principle applies to any
syntactic reduction from any sequence of syntactic types, but how to build
relational vectors for all complex types is an open question. Second, it needs
to be demonstrated that the compositional distributional representations pre-
sented in this chapter can be useful for language processing tasks and appli-
cations. Third, there is a large part of natural language semantics, much of
which is the focus of traditional formal semantics, such as logical operators,
quantification, inference, and so on, which has been ignored in this chapter.
We see distributional models of semantics as essentially providing a seman-
tics of similarity, but whether the more traditional questions of semantics
can be accommodated in distributional semantics is an interesting and open
question.11 Fourth, there is the question of whether the vectors for relational
types can be driven more from a machine learning perspective, rather than
have their form determined by linguistic intuition (as was the case for the
verb vectors living in N⊗N). Addressing this question would bring together
the work in distributional semantics with the recent work in so-called deep
learning in the neural networks research community.12 And finally, there is the
question of how sentences should be represented as vectors. Two possibilities
11 There is some preliminary work in this direction, e.g. (Preller & Sadrzadeh, 2009;
Clarke, 2008; Widdows, 2004).12 See the proceedings for the NIPS workshops in 2010 and 2011 on Deep Learning
and Unsupervised Feature Learning.
Page: 30 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 31
have been suggested in this chapter, but there are presumably many more;
it may be that the form of the sentence space should be determined by the
particular language processing task in hand, e.g. sentiment analysis (Socher
et al., 2011).
In summary, this chapter is a presentation of Coecke et al. (2010) de-
signed to be accessible to computational linguists. The key innovations in this
work are the use of complex vector spaces for relational types such as verbs,
through the use of the tensor product, and a general method for composing
a relational type (represented as a vector in a tensor product space) with its
vector arguments.
Page: 31 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
32 Stephen Clark
7 Acknowledgements
Almost all of the ideas in this chapter have arisen from discussions with
Merhnoosh Sadrzadeh, Ed Grefenstette, Bob Coecke and Stephen Pulman.
Page: 32 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 33
References
Baroni, M. & R. Zamparelli (2010), Nouns are vectors, adjectives are matrices:
Representing adjective-noun constructions in semantic space, in Conference on
Empirical Methods in Natural Language Processing (EMNLP-10), Cambridge,
MA.
Buszkowski, Wojciech & Katarzyna Moroz (2006), Pregroup grammars and context-
free grammars, in C. Casadio & J. Lambek (eds.), Computational algebraic ap-
proaches to natural language, Polimetrica, (1–21).
Clark, Stephen & James R. Curran (2007), Wide-coverage efficient statistical parsing
with CCG and log-linear models, Computational Linguistics 33(4):493–552.
Clarke, Daoud (2008), Context-theoretic Semantics for Natural Language: An Alge-
braic Framework, Ph.D. thesis, University of Sussex.
Coecke, B., M. Sadrzadeh, & S. Clark (2010), Mathematical foundations for a com-
positional distributional model of meaning, in J. van Bentham, M. Moortgat,
& W. Buszkowski (eds.), Linguistic Analysis (Lambek Festschrift), volume 36,
(345–384).
Curran, James R. (2004), From Distributional to Semantic Similarity, Ph.D. thesis,
University of Edinburgh.
Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, &
Richard Harshman (1990), Indexing by latent semantic analysis, Journal of the
American Society for Information Science 41(6):391–407.
Dowty, D.R., R.E. Wall, & S. Peters (1981), Introduction to Montague Semantics,
Dordrecht.
Grefenstette, Edward & Mehrnoosh Sadrzadeh (2011), Experimental support for a
categorical compositional distributional model of meaning, in Proceedings of the
2011 Conference on Empirical Methods in Natural Language Processing, Asso-
ciation for Computational Linguistics, Edinburgh, Scotland, UK., (1394–1404).
Page: 33 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
34 Stephen Clark
Grefenstette, Edward, Mehrnoosh Sadrzadeh, Stephen Clark, Bob Coecke, &
Stephen Pulman (2011), Concrete sentence spaces for compositional distribu-
tional models of meaning, in Proceedings of the 9th International Conference on
Computational Semantics (IWCS-11), Oxford, UK, (125–134).
Hockenmaier, Julia & Mark Steedman (2007), CCGbank: a corpus of CCG deriva-
tions and dependency structures extracted from the Penn Treebank, Computa-
tional Linguistics 33(3):355–396.
Lambek, Joachim (1958), The mathematics of sentence structure, American Math-
ematical Monthly (65):154–170.
Lambek, Joachim (2008), From Word to Sentence. A Computational Algebraic Ap-
proach to Grammar, Polimetrica.
Lawvere, F. William & Stephen Hoel Schanuel (1997), Conceptual Mathematics: A
First Introduction to Categories, Cambridge University Press.
Mitchell, Jeff & Mirella Lapata (2008), Vector-based models of semantic composi-
tion, in Proceedings of ACL-08, Columbus, OH, (236–244).
Moortgat, Michael (1997), Categorial type logics, in Johan van Benthem & Alice
ter Meulen (eds.), Handbook of Logic and Language, Elsevier, Amsterdam and
MIT Press, Cambridge MA, chapter 2, (93–177).
Nielsen, Michael A. & Isaac L. Chuang (2000), Quantum Computation and Quantum
Information, Cambridge University Press.
Preller, Anne & Mehrnoosh Sadrzadeh (2009), Bell states as negation in natural
languages, ENTCS, QPL .
Shieber, Stuart M. (1985), Evidence against the context-freeness of natural language,
Linguistics and Philosophy 8:333–343.
Socher, Richard, Jeffrey Pennington, Eric Huang, Andrew Y. Ng, & Christopher D.
Manning (2011), Semi-supervised recursive autoencoders for predicting senti-
ment distributions, in Proceedings of the Conference on Empirical Methods in
Natural Language Processing, Edinburgh, UK.
Steedman, Mark (2000), The Syntactic Process, The MIT Press, Cambridge, MA.
Page: 34 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43
Type-Driven Syntax and Semantics for Composing Meaning Vectors 35
Steedman, Mark & Jason Baldridge (2011), Combinatory categorial grammar, in
Robert Borsley & Kersti Borjars (eds.), Non-Transformational Syntax: Formal
and Explicit Models of Grammar, Wiley-Blackwell.
Vijay-Shanker, K. & David Weir (1993), Parsing some constrained grammar for-
malisms, Computational Linguistics 19:591–636.
Weir, David (1988), Characterizing Mildly Context-Sensitive Grammar Formalisms,
Ph.D. thesis, University of Pennsylviania.
Widdows, Dominic (2004), Geometry and Meaning, CSLI Publications, Stanford
University.
Page: 35 job: sclark macro: handbook.cls date/time: 27-Feb-2012/12:43