A vector space model of semantics using Canonical ...ungar/papers/EMNLP_simple.pdfa vector space...

A vector space model of semantics using Canonical Correlation Analysis

Paramveer DhillonDepartment of Computer Science

University of [email protected]

Dean P. FosterStatistics


Lyle H. UngarDepartment of Computer Science


Abstract

We present an efficient method that usescanonical correlation analysis (CCA) betweenwords and their contexts (i.e., the neighbor-ing words) to estimate a real-valued vector foreach word that characterizes its “hidden state”or “meaning”. The use of CCA allows us toprove theorems characterizing how accuratelywe can estimate this hidden state. Recentlydeveloped algorithms for computing the re-quired singular eigenvectors make it easy tocompute models for billions of words of textwith vocabularies in the hundreds of thou-sands. Experiments on the Google-ngram col-lection show that CCA between words andtheir contexts provides a mapping from eachword to a low dimensional feature vector thatcaptures information about the meanings ofthe word. Unlike latent semantic analysis,which uses PCA, our method takes advantageof the information implicit in the word se-quences.

1 The problem of state estimation

Many people have clustered words based on theirdistributionally similarity (see, among many articles,(Pereira et al., 1993)), but such clustering ignores themany different dimensions that similarity could becomputed on. We instead, characterize words usinga vector space model (Turney and Pantel, 2010). Wepresent a method for learning language models thatestimates a “state” or latent variable representationfor words based on their context. The vector learnedfor each word captures a wide variety of informationabout it, allowing us to predict the word’s part of

speech, linguist features such as animacy, member-ship in wide variety of semantic categories (foods,drinks, colors, males, females ...) and direction ofsentiment (happy vs. sad).

More precisely, we estimate a hidden state associ-ated with words by computing the dominant canoni-cal correlations between target words and the wordsin their immediate context. The main computation,finding the singular value decomposition of a scaledversion of the co-occurrence matrix of counts ofwords with their contexts, can be done highly effi-ciently. Use of CCA also allows us to prove theo-rems about the optimality of our reconstruction ofthe state.

Our CCA-based multi-view learning method canbe thought of as a generalization of widely used la-tent semantic analysis (LSA) methods. LSA (some-times called LSI, when used for indexing) usesa principle component analysis (PCA) of the co-occurrence matrix between words and the docu-ments they occur in to learn a latent vector repre-sentation of each word type (Landauer et al., 2008).We extend this method in two ways: (1) By lookingat the correlation between words and the words intheir context, we find a latent vector for each wordthat depends upon the sequence of nearby words,unlike the bag-of-words model standardly used inLSA. (2) We rescale the covariance matrix betweenwords and context to give a correlation matrix (i.e.,use CCA), allowing us to prove theorems about howaccurately we can estimate the state. Unlike PCA,as we show below, in CCA the dominant singularvectors are guaranteed to capture the key state infor-mation. Importantly, our method scales well, han-

dling gigaword corpora with vocabularies of tens ofthousands of words on a small computer.

Using CCA to estimate a latent vector for eachword based on the contexts it appears in, gives asignificantly different (and as we show below, oftenmore useful) representation than, for example, tak-ing the PCA of the same data used to generate theCCA.

We show below how to efficiently compute a vec-tor that characterizes each word type by using theright singular values of the above CCA to map fromthe word space (size v) to the state space (size k).We call this mapping the attribute dictionary forwords, as it associates with every word a vector thatcaptures that word’s attributes. As will be madeclear below, the attribute dictionary is arbitrary up toa rotation, but captures the information needed forany linear model to predict properties of the wordssuch as part of speech or word sense.

Characterizing words by a low dimensional (e.g.50 dimensional) vector as we do is valuable in thatwords can be partitioned in many ways. They canbe grouped based on part of speech, or by many as-pects of their “meaning” including features such asanimacy, gender, whether the word describes some-thing that is good to eat or drink, whether it refersto a year or to a small number, etc. Such featurescan be used, for example, as features for word sensedisambiguation or named entity disambiguation, orto help improve parsing algorithms.

This paper:• presents a simple CCA-based spectral method

for estimating state in language models,• gives an efficient implementation of our CCA

method,• proves that we provide an optimal reconstruc-

tion of the state associated with a word (given,of course, some assumptions), and• applies this method to the Google n-gram col-

lection and gives some examples showing whatfeatures are captured by the state.

2 CCA-based language modeling

The basic data we use is a matrix, A, of co-occurrence counts between word and context, whereeach column corresponds to a word in the vocabu-lary (size v) and each row corresponds to a word in

the vocabulary at each location relative to the tar-get word. For example, for trigrams, A2v×v wouldhave rows for each possible word immediately be-fore the target word and for each word immediatelyafter the target word. Five-grams give a similar ma-trix, A4v×v.

The key idea is to compute the singular valuedecomposition (SVD) of A = ΨΛΦT or, as de-scribed below, a rescaled version of A which con-tains the correlations between words and their con-texts, rather than the covariances (the counts). Wethen use the left and right singular vectors Ψ andΦ to estimate the hidden state associated with eachword.

The k “largest” right singular vectors φi of ATA,i.e., the solutions of ATAφi = λiφi with the largestsingular values λi, form the matrix Φ (our “at-tribute dictionary”), where each of the v rows is ak-dimensional estimate of a latent vector associatedwith one of the words in the vocabulary.

The k “largest” left singular vectors ψi of AAT ,i.e., the solutions of AATψi = λiψi form the matrixΨ, which gives a mapping from a context to the stateassociated with it. (That is, an estimate of the stateof the target word inside the context.)

When A is taken to be the correlation matrix be-tween word and context, both of these estimates areoptimal in a sense which is made precise below.

2.1 Theoretical propertiesWe now discuss how well the hidden state can be es-timated from the target word. (A similar, but slightlymore technical result can be derived for estimatinghidden state from the context.) The state estimatedis arbitrary up to any linear transformation, so allour comments address our ability to use the state toestimate some label which depends linearly on thestate.

We start with a document consisting of n tokensw1, w2, ..., wn drawn from a vocabulary of v words(actually, it is the concatenation of a vast numberof documents). Let W and C be matrices in whichthe ith row describes either the ith token, wi (forW ) or its context – the words to its right and left– (for C). In W , we represent the presence of thejth word type in the ith position in a document bysetting matrix element wij = 1. C is similar, buthas columns for each word in each position in the

context. For example, for a trigram representationwith vocabulary size v words, C has 2v columns –one for each possible word to the left of the targetword and one for each possible word to the right ofthe target word.

Then Acw = CTW contains the counts of howoften each word w occurs in each context c, Acc =CTC gives the covariance of the contexts, andAww = W TW , the word covariance matrix, is a di-agonal matrix with the counts of each word. Canon-ical Correlation Analysis (CCA) (Hotelling, 1935;Hardoon and Shawe-Taylor, 2008) can be viewed asfinding the singular value decomposition of the cor-relation matrix1

A−1/2cc AcwA−1/2ww = ΨDΦ. (1)

Keeping the dominant singular vectors in Φ andΨ provides two different bases for estimated state.Each is optimal in its own way, as explained below.

The following Theorem 1 shows that the rightcanonical correlates give an optimal approximationto the state of each word (in the sense of being ableto estimate an emission or label Y for each state),subject to only using the word and not its context.

Theorem 1 Let (Wt, Ct, Yt) for t = 1 . . . n be n tu-ples of random variables drawn IID from some dis-tribution (pdf or pmf) D(wt, ct, yt). We call the pair(Y1 . . . Yn, β) a linear context problem if

1. Yt is a linear function of the context (i.e. Yt =αTCt)

2. βTWt is the best linear estimator of Yt givenWt, namely β minimizes E

∑t(Yt − βTWt)

2

and

3. Var(Yt) ≤ 1.

1The formulation with square roots of the covariances, whiletheoretically nice, can be computationally expensive, so manypresentations of CCA use an alternate formulation where theright canonical correlates are given by the right eigenvectors φofA−1

cc AcwA−1wwAwc and similarly, the left canonical correlates

are given by the right eigenvectors ψ of A−1wwAwcA

−1cc Acw.

Like PCA, CCA can be cast as an eigenvalue problem ona covariance matrix, but can also be interpreted as derivingfrom a generative mixture model (Bach and Jordan, 2005).See (Hardoon et al., 2004) for a review of CCA with applica-tions to machine learning.

Define φi to be the i’th right singular vector forthe SVD of Eq. 1 with Acc = E(CCT ), Acw =E(CW T ), and Aww = E(WW T ) where (W,C)are drawn from the marginal distribution D(w, c).Then, for all ε > 0 there exists a k such that for anylinear context problem (Y1 . . . Yn, β), there exists aγ ∈ <k such that Yt =

∑ki=1 γiφit is a good approx-

imation to Yt in the sense that E(Yt − βTW )2 ≤ ε.

Proof of Theorem 1:With out loss of generality, we can assume thatW

and C have been transformed to their canonical cor-relations coordinate space. So V ar(W ) is the iden-tity and V ar(C) is the identity, and the Cov(W,C)is a diagonal with non-increasing values ρi on thediagonal (namely the correlations / singular values).We can write α and β in this coordinate system.By orthogonality we now have βi = ρiαi. So,E(Y − βW )2 is simply

∑(αi − βiρi)2. Which is∑

α2i (1− ρ2i ). Our estimator will then have γi = βi

for i smaller than k and γi = 0 otherwise. Hence(Y − βTW )2 =

∑∞i=k+1 β

2i .

So if we pick k to include all terms which haveρi ≥

√ε our error will be less than ε

∑∞i=k+1 α

2i ≤

ε.q.e.d.

To understand the above theorem, note that wewould have liked to have a linear regression predict-ing some label y from the original data w. However,the original data is very high dimensional. Instead,we can first use CCA to map high dimensional vec-tors w to lower dimensional vectors φ, from whichy can be predicted. For example with a few labeledexamples of the form (w, y), we can recover the γiparameters using linear regression. The φ subspaceis guaranteed to hold a good approximation. A spe-cial case of interest occurs when estimating a labelZ which is αTCT plus zero mean noise. In this case,one can pick Y = E(Z) and proceed as above. Thiseffectively extends the theorem to the case where themapping from C to Y is random, not deterministic.

Note that if we had used covariance rather thancorrelation (i.e., not normalized by A

−1/2cc and

A−1/2ww , and hence computed the canonical covari-

ance rather than the canonical correlation) (as doneby LSA/PCA) then in the worst case, the key singu-lar vectors for predicting state could be those witharbitrarily small singular values. This corresponds

to the fact that for principle component regression(PCR), there is no guarantee that the largest princi-ple components will prove predictive of an associ-ated label. In practice, one can often still get goodestimates for our method without the normalization.

One can think of Theorem 1 as implicitly esti-mating a k-dimensional hidden state from the ob-served W . This hidden state can be used to esti-mate Y . Note that for Theorem 1, the state estimateis “trivial” in the sense that because it comes fromthe words, not the context, every occurrence of eachword must give the same state estimate. This is at-tractive in that it associates a latent vector with everyword type, but limiting in that it does not allow forany word ambiguity. The left canonical vectors al-low one to estimate state from the context of a word,giving different state estimates for the same word indifferent contexts, as is needed for word sense dis-ambiguation. To keep this paper simple, we will notdiscuss this approach, but instead focus on the sim-pler use of right canonical covariates to map eachword type to a k dimensional vector. Below, weshow examples illustrating the information capturedby this state estimate.

3 Efficient estimation algorithm

Recent advances in algorithms for computing SVDdecompositions of matrices make it easy to computethe singular values of large matrices (Halko et al.,2011), thus making spectral methods easy to use ondata with very large vocabularies, such as those inthe Google n-gram collection. The key idea is to finda lower dimensional basis for A, and to then com-pute the singular vectors in that lower dimensionalbasis. The initial basis is generated randomly, andtaken to be slightly larger than the eventual basis. IfA is vd× v, and we seek a state of dimension k, westart with a (k+l)×vdmatrix Ω of random numbers,where l is number of “extra” basis vectors between0 and k. We then project A onto this matrix andtake the SVD decomposition of the resulting matrix.ΩA = U1D1V

T1 . Since ΩA is (k + l) × vd, this

is much cheaper than working on the original ma-trix A. We keep the largest k components of U1 andof V1, which form a left and a right basis for A re-spectively. The rank k matrices UTUA and AV TVfrom good approximations to A. The matrix V ap-

Algorithm 1 Randomized singular value decompo-sition

1: Input: matrix A of size c × w, the desired hid-den state dimension k, and the number of “ex-tra” singular vectors, l

2: Generate a (k + l)× c random matrix Ω3: Find the SVD U1D1V

T1 of ΩA, and keep the

k+ l components of V1 with the largest singularvalues

4: Find the SVD U2D2VT2 of AV1, and keep the

’largest’ k + l components of U2

5: Find the SVD U3D3VTfinal of UT

2 A, and keepthe ’largest’ k components of Vfinal

6: Find the SVD UfinalD4VT4 of AVfinaland keep

the ’largest’ k components of Ufinal

7: Output: The canonical covariates Ψ = Ufinal,Φ = V T

final, up to an arbitray sign

proximates the dominant left singular vectors of A,Φ, and U approximates the right singular vectors,Ψ. Better approximations can be found by iterating.We take two more iterations, as summarized in al-gorithm 1. First find the SVD of AV T

1 = U2D2VT2 .

U2 and V T2 form improved estimates for the left and

right singular vectors of A. Then find the SVD ofAV T

2 = U3D3VT3 to find yet better estimates. This

iteration rapidly converges.

(Halko et al., 2011) prove a number of nice prop-erties of the above algorithm. In particular, theyguarantee that the algorithm, even without the extraiterations in steps 4 and 5 produces an approxima-tion whose error is bounded by a small polynomialfactor times the size of the largest singular valuewhose singular vectors are not part of the approx-imation, σk+1. They also show that using a smallnumber of “extra” singular vectors (l) results in asubstantial tightening of the bound, and that the ex-tra iterations, which correspond to power iteration,drive the error bound exponentially quickly to onetimes the largest non-included singular value, σk+1.

A practical question to be answered is how to nor-malize the data to generate the A matrix. Normal-izing Acw by postmultiplying by A

−1/2ww is trivial,

since the later matrix is diagonal. Normalizing Acw

by A−1/2cc is, however, not practical for large data

sets, as it requires inverting a prohibitively large ma-

trix. In practice, one can either replace this matrixby the identity, as is often done in CCA (see e.g.,(Witten and Tibshirani, 2009)) or one can use simi-lar dimensionality reduction techniques to those de-scribed above, and take advantage of the fact that theinverse of Acc can be well approximated by findingits (approximate) SVD composition UDV ′ and theninverting the diagonal matrix D.

4 Experimental Results

The state estimates for words capture a wide rangeof information about them that can be used to pre-dict part of speech, linguistic features, and mean-ing. Before presenting a more quantitative evalua-tion of predictive accuracy, we present some qual-itative results showing how word states, when pro-jected in appropriate directions usefully characterizethe words.

In the results presented below, we started with theGoogle n-gram collection which has counts of themost frequent 1 to 5-grams on the web2. The col-lection is extensive, having a vocabulary of thirteenmillion words and over a billion 5-grams derivedfrom a trillion words of text from the web.

4.1 Qualitative Evaluation

To illustrate the sorts of information captured in ourstate vectors, we present a set of figures constructedby projecting selected small sets of words onto thespace spanned by the second and third largest prin-cipal components of their “attribute dictionary” val-ues, which are simply the right canonical correlatescalculated from equation 1. (The first principle com-ponent generally just separates the selected wordsfrom other words, and so is less interesting here.)

In all of the following examples, a 30-dimensionalstate vector was estimated using as context the oneword proceeding and one word after each word inthe 3-gram collection of the Google n-grams. Themiddle (second) word of each trigram is then theword being correlated with the context. Note thatthis procedure is fully unsupervised, although wecould easily train a classifier based on the attributedictionary to separate out different classes of wordsusing only a few hand-labeled examples.

2http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

To make clear the contrast with PCA, we take ex-actly the same trigrams, and do PCA on their them.(To make this look like LSA, think of each trigrambeing a document containing three words. Since agiven trigram will occur many times, there are thenmany copies of each “document”.)

Figure 1 shows plots for three different sets ofwords. The left column uses the attribute dictionaryfrom CCA, while the right column uses the corre-sponding latent vectors derived using PCA on thesame data. In all cases, the 30-dimensional vec-tors have been projected onto two dimensions (usingPCA) so that they can be visualized.

The top row shows a small set of randomly se-lected nouns and verbs. Note that for CCA nouns areon the left, while verbs are on the right. Good sep-aration is achieved with no supervision. Words thatare of similar or opposite meaning (e.g. “agree” and“disagree”) are distributionally similar, and henceclose. The corresponding plot for PCA shows somestructure, but does not give such a clean separation.This is not surprising; predicting the part of speechof words depends on the exact order of the words intheir context (as we capture in CCA); a PCA-stylebag-of-words can’t capture part of speech well.

The second row in Figure 1 shows the CCA andPCA attribute dictionaries for a few pronouns andpossessives. In the CCA plot on the left, we cansee the nominative pronouns (he, she, they) in thelower right corner, the possessives (his, her) on thelower left of the figure. Third person singular pairs(he/she and his/her) are particularly close together.Note that two dimensions are not sufficient to fullyseparate the different parts of speech here; differentprojections of the data would separate out those partsof speech. Again PCA fails to give clear separation.

When one picks sets of words that are highly sim-ilar, such as names of people, the projections revealmore subtle features of those words. The third rowshows a fairly good separation of male and femalenames along the diagonal from lower left to upperright. Closer inspection of the plot reveals that theother dimension separates names based on formal-ity, with more complete names such as Joseph andThomas on the lower right and shorter versions likeJoe and Tom on the upper left.

The latent state as estimated by CCA thus cap-tures a wide variety of attributes of the words: sin-

gular or plural, formal or informal, and as shownbelow, good to eat or not, happy or sad, successfulor failed – all from a single run of doing CCA.

-0.4 -0.2 0.0 0.2 0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

PC 2

PC

3

homecar

word

talk

river

agree

cat

listen

boatcarry

trucksleep

drink

eatpush

disagree

-8e+07 -6e+07 -4e+07 -2e+07 0e+00 2e+07 4e+07

-4e+07

-2e+07

0e+00

2e+07

PC 2

PC

3 home

car

word

talk

river

agree

cat

listen

boat

carry

truck

sleepdrink

eat

pushdisagree

-0.4 -0.2 0.0 0.2

-0.1

0.0

0.1

0.2

0.3

PC 2

PC

3

iyou

we

us

our

theyhe

his

them

her

shehim

-5.0e+08 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09

-1e+09

-5e+08

0e+00

5e+08

PC 2

PC

3

i

you

we

us

our

theyhehisthemhershehim

-0.2 -0.1 0.0 0.1

-0.10

-0.05

0.00

0.05

0.10

PC 2

PC

3

johndavidmichael paul

robert

george

thomas

william

maryrichard

miketom

charles

bobjoe

joseph

daniel

dan

elizabeth

jennifer

barbarasusan christopher

lisa

lindamaria

donald

nancy

karen

margaret

helen

patricia

bettyliz

dorothybetsy

tricia

-1e+07 -5e+06 0e+00 5e+06

-6e+06

-4e+06

-2e+06

0e+00

2e+06

4e+06

6e+06

PC 2

PC

3

john

david

michael

paul

robert

george

thomas

william

mary

richard

miketom

charles

bobjoe

josephdaniel

dan

elizabethjennifer

barbarasusanchristopher

lisalindamaria donaldnancykaren margaret

helenpatricia

bettylizdorothybetsytricia

Figure 1: Projections onto two dimension of selected words in different categories using both CCA (left) and PCA(Right), all using the same trigrams from Google n-grams.

Figure 2 shows attribute dictionary projections fora few foods and drinks. (Due to space limitations,only the CCA is shown.) One can see drinks in theupper left, carrots and lettuce on the lower right, anda variety of dishes that one might eat for a mealon the lower left. (I know, that implies that onewouldn’t eat carrots or lettuce; let’s just note thatdifferent people or animals eat them.)

-0.2 -0.1 0.0 0.1 0.2 0.3

-0.2

-0.1

0.0

0.1

PC 2

PC

3

wine

coffeetea milk

juice

salad

sandwich

steak

coke

pepsi

lettuce

carrots

hamburger

lemonade

Figure 2: Projections of the attribute dictionary fromCCA on the Google tri-grams of words representingitems to eat or drink.

We have also plotted (not shown) names of num-bers or the numerals representing numbers Numbersthat are close to each other in value tend to be closein the plot, thus suggesting that state captures notjust classifications, but also more continuous hiddenfeatures.

The theme running through these figures is thatthe attribute dictionary, computed using the rightcanonical correlates (or singular vectors), captures arich set of features characterizing the words. Thesevectors could be used in many applications to gener-alize beyond the particular words they represent.

4.2 Quantitative EvaluationThe attribute dictionary which we learn forms a lowrank representation of each of the words in the vo-cabulary. These vectors can then be used as featuresin supervised models to predict various labels on thewords, such as part of speech, word sense, or entitytype.

In addition to the unsupervised results shown inthe above pictures, we also evaluated our vector

Word sets Number of observationsClass I Class II

Positive emotion or not 81 162Meaningful life or not 246 46Achievement or not 159 70Engagement or not 208 93Relationship or not 236 204

Male vs. Female name 447 330Number vs. Color 33 34Animal vs. Bird 129 20

Table 1: Description of the datasets used. All the datawas collected either from Wikipedia or PERMA lexicon.We plan to make the data available publicly after the con-ference.

models of words on a number of supervised binaryclassification problems. The word lists for all theseproblems were gotten from Wikipedia or from Mar-tin Seligman’s PERMA lexicon (Seligman, 2011),and the state vector for each word was learned usingGoogle n-gram data as described above. (“PERMA”is a positive psychology construct that measures howmuch people have positive emotion, engagement, re-lationships, meaning, and a sense of achievement intheir lives. These attributes constitute measures ofsubjective well-being.)

In all the experiments we used 75% of the data(See Table 1) chosen randomly for training andremaining data for testing and the numbers re-ported are averaged over 10 such runs. We used aSVM (Chang and Lin, 2001) with RBF kernel forthe binary classification task. The test-set accura-cies for CCA and PCA are shown in Table 2 andas can be seen CCA is significantly better in all butone binary classification tasks. The only exceptionis male vs. female names classification task, whereusing the non-sequential PCA models works as wellas CCA.

5 Related work

There is, of course, a very long tradition of usingvector space models of language, almost universallybased on PCA. See, for example, the excellent re-cent review in (Turney and Pantel, 2010). However,none of the methods cited in that paper use CCA orits non-whitened cousin, canonical covariance esti-mation. PCA, which is by far the most widely usedvector space model, tries to capture all of the vari-

Word sets Majority PCA CCABaseline

Positive emotion or not 66.1 95.3± 0.2 97.6± 0.1Engagement or not 68.4 94.9± 0.2 97.4± 0.2Relationship or not 53.7 96.1± 0.4 98.0± 0.3

Meaningful life or not 83.8 95.9± 0.1 97.7± 0.3Achievement or not 69.0 97.3± 0.3 99.1± 0.4Number vs. Color 50.0 95.8± 0.2 99.7± 0.4

Male vs. Female name 56.9 84.2± 0.2 84.3± 0.3Animal vs. Bird 86.9 97.1± 0.1 98.2± 0.2

Table 2: Test set accuracies for each binary classification problem. “Majority baseline” always uses the more commonlabel. PCA and CCA are computed as described above using the Google n-grams to estimate the word vectors andthen using these attribute vectors as features in an SVM. The first five categories are from the “PERMA” lexicon (seetext). Note: The numbers in bold are statistically significant at 5% level over the 10 runs.

ance in matrix, while CCA captures the most im-portant components of covariance between two ma-trices. Thus CCA, as we use it, distinguishes oneword, the target, from the context. In contrast, PCAtreats all words in the n-gram equivalently, ignoring,among other things, the sequence of the words in then-gram.

LSA (PCA between words and documents) failsto capture the local structure of language. HMMs,CRFs and FSA (Finite State Automaton) are widelyused for language models that estimate state forobserved words based on a more immediate con-text. These sequence-analysis methods work wellfor some tasks such as part of speech (POS) taggingand some word sense disambiguation problems, butthey are limited in that they generally fail to capturelonger range influences (e.g., of “topic” of the doc-ument) such as is captured by LSA and they tendto scale poorly with the dimension of the hiddenstate, making it hard to cover all meanings in entirelanguage. HMMs, if unsupervised, require use ofEM algorithms which are relatively slow. CRFs re-quire labeled data (unlike CCA, which learns an at-tribute dictionary independently of what it will even-tually be used for) and are also relatively expensiveto train.3

In contrast, our CCA-based method scales ex-tremely well, is guaranteed to find its single global

3There has been some recent work on semi-supervised se-quence labeling (Jiao and et al, 2006; Mann and McCallum,2010), but these models generally use domain-specific con-straints, and are much more cumbersome and slow to train thanCCA-based methods

optimal solution, has provable computational com-plexity and approximation accuracy.

Closer in style to our CCA-based approach, thereis a recent resurgence of interest in using CCA-stylespectral methods for estimating HMMs (Hsu et al.,2009) or similar linear dynamical systems (Siddiqiet al., 2010; Song et al., 2010). These methods,while attractive in, like us, making use of sequenceinformation, tend to be slightly more complex anddifficult to analyze than the algorithm presented inthis paper. The more complex spectral sequencemethods have also not been demonstrated to workon real language problems.

The major application CCA to language hasbeen in the field of machine translation, where ithas been recognized that CCA in which the twoviews are composed of corresponding text in dif-ferent languages (instead of our context and wordpairs) can be used to extract latent vectors with theshared meaning between the languages. (Hardoonand Shawe-Taylor, 2008; Haghighi et al., 2008)

6 Discussion

We have argued that for many problems, CCA cangive better feature vectors for words than PCA.CCA, in our application, finds the components ofmaximum correlation between the context words,taking into account their location in the context, withthe word of interest, unlike PCA on n-grams, whichfinds treats all words in the n-gram equivalently.

PCA and CCA share deep similarities, not justin both being spectral methods. If the word co-

occurrence matrix for PCA is normalized by scalingeach word by dividing by the square of its overallfrequency, then in the special case of bigrams, PCAand CCA become identical. In this special case, thecontext and the target word covariances C ′C andW ′W become (after normalization) the identity ma-trix. Since CCA scales by the inverse of these co-variance matrices, if they are identity matrices, PCAand CCA will have identical singular vectors.

In the more common case of a larger context, PCAwill devote more degrees of freedom to finding thestructure within the context, while CCA will focuson finding the correlation between context and tar-get word. If we could afford to compute and uselarger state spaces, this would not be too serious, butbecause we are working with large corpora and largevocabularies, even a five-fold reduction in the num-ber of components that is kept matters.

More broadly, we have argued that a single vec-tor for each word can capture a wide range of at-tributes of that word including, part of speech, an-imacy, sex, edibility, etc. One could instead haveclustered words based on distributional similarityusing, e.g., (Pereira et al., 1993) or (Brown et al.,1992), but one would need some complex multi-faceted hierarchical clustering scheme to come closeto capturing the different dimensions represented inthe attribute vectors. For example, should “he” and“she” be in the same or different clusters? The wordsare very similar on many dimensions, but opposedon at least one. Using vector models also has advan-tages over categories in allowing word meaning tosit on a continuum, rather than being binned into dis-crete categories. There is substantial evidence fromhuman studies that word meanings are often inter-preted on a graded scale (Erk and McCarthy, 2009),rather than categorically.

This paper has focused on the right canonical co-variates (CCs), which give vectors characterizingeach word type. We showed that these vectors char-acterize both the part of speech and the “meaning” ofwords. These right CCs, that we called the “attributedictionary”, have the disadvantage that for wordswith multiple parts of speech or meanings, they willbe a weighted average of these meanings. In suchcases, one could use the attribute dictionary to mapthe context words down to the low dimensional statespace and then predict properties of a particular to-

ken based on the reduced dimensional representa-tion of its context. Alternatively, one could use theleft CCs, which map from the context each tokento its associated state. Either method gives differ-ent state estimates for the same word in differentcontexts and can be used for problem such as POStagging, word sense disambiguation, and named en-tity disambiguation. Under the assumption that thedata are generated from a Hidden Markov Model(HMM), one can give proofs of power of the leftCCs that are similar in flavor to Theorem 1 or to theHMM estimation schemes in (Hsu et al., 2009).

We are currently extending this work in two di-rections. We are exploring the use of larger con-texts than the n-gram models we used above, andwe are starting to use similar CCA-based state es-timation methods to learn probabilistic context freegrammars. We believe that characterizing words us-ing a vector-valued attribute dictionary will offer ad-vantages over the (hard) word clusters used in somerecent lexicalized parsers such as (Koo et al., 2008).

References

F.R. Bach and M.I. Jordan. 2005. A probabilistic inter-pretation of canonical correlation analysis. In TR 688,University of California, Berkeley.

P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai,and R. L. Mercer. 1992. Class-based n-gram mod-els of natural language. Computational Linguistics,18(4):467–479.

Chih-Chung Chang and Chih-Jen Lin, 2001. ”LIB-SVM: a library for support vector machines”. Soft-ware available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

Katrin Erk and Diana McCarthy. 2009. Graded wordsense assignment. In Proceedings of the 2009 Confer-ence on Empirical Methods in Natural Language Pro-cessing, pages 440–449.

A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein.2008. Learning bilingual lexicons from monolingualcorpora. In Proc. of ACL-08: HLT, pages 771–779.

Nathan Halko, Per-Gunnar Martinsson, and Joel A.Tropp. 2011. Finding structure with randomness:Probabilistic algorithms for constructing approximatematrix decompositions. SIAM Rev.

David Hardoon and John Shawe-Taylor. 2008. Sparsecca for bilingual word generation. In EURO Mini Con-ference, Continuous Optimization and Knowledge-Based Technologies.

D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-Taylor.2004. Canonical correlation analysis: An overviewwith application to learning methods. Neural Comput.,16(12):2639–2664.

H. Hotelling. 1935. Canonical correlation analysis (cca).Journal of Educational Psychology.

D. Hsu, S. M. Kakade, and T. Zhang. 2009. A spec-tral algorithm for learning hidden markov models. InCOLT.

Feng Jiao and et al. 2006. Semi-supervised condi-tional random fields for improved sequence segmen-tation and labeling. ACL, pages 209–216.

Terry Koo, Xavier Carreras, and Michael Collins. 2008.Simple semi-supervised dependency parsing. In Proc.ACL.

TK Landauer, PW Foltz, and D Laham. 2008. An in-troduction to latent semantic analysis. In Discourseprocesses.

G. S. Mann and A. McCallum. 2010. Generalizedexpectation criteria for semi-supervised learning withweakly labeled data. JMLR, 11:955–984, March.

F. Pereira, N. Tishby, and L. Lee. 1993. Distributionalclustering of english words. In ACL, pages 183–190.

Martin Seligman. 2011. Flourish: A Visionary New Un-derstanding of Happiness and Well-being. Free Press.

S. Siddiqi, B. Boots, and G. J. Gordon. 2010. Reduced-rank hidden Markov models. In AISTATS-2010.

L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J.Smola. 2010. Hilbert space embeddings of hiddenMarkov models. In ICML.

P.D. Turney and P. Pantel. 2010. From frequency tomeaning: vector space models of semantics. Journalof Artificial Intelligence Research, 37:141–188.

M Witten and Robert J. Tibshirani. 2009. Extensions ofsparse canonical correlation analysis with applicationsto genomic data. In Statistical Applications in Genet-ics and Molecular Biology, volume 8.

A vector space model of semantics using Canonical ...ungar/papers/EMNLP_simple.pdfa vector space...

Documents

Transcript of A vector space model of semantics using Canonical ...ungar/papers/EMNLP_simple.pdfa vector space...