Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda,...

40
Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information Retrieval Evaluation Kolkata, India, December 12 th -14 th , 2008

Transcript of Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda,...

Page 1: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Machine Learning for Textual Information Access: Results from the SMART project

Nicola Cancedda, Xerox Research Centre Europe

First Forum for Information Retrieval Evaluation

Kolkata, India, December 12th-14th, 2008

Page 2: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

• Statistical Multilingual Analysis for Retrieval and Translation (SMART)• Information Society Technologies Programme

• Sixth Framework Programme, “Specific Target Research Project” (STReP)

• Start date: October 1, 2006

• Duration: 3 years

• Objective: bring Machine Learning researchers to work on Machine Translation and CLIR

The SMART Project

Page 3: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

The SMART Consortium

Page 4: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

The SMART Consortium

Page 5: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Premise and Outline

• Two classes of methods for CLIR investigated in SMART– Methods based on dictionary adaptation for the cross-

language extension of the LM approach in IR– Latent semantic methods based on Canonical Correlation

Analysis• Initial plan (reflected in abstract): to present both

– ...but it would take too long, so:

• Outline:– (Longish) introduction to state of the art in Canonical

Correlation Analysis– A number of advances obtained by the SMART project

For lexicon adaptation methods: check out deliverable D 5.1 from the project website!

Page 6: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Background: Canonical Correlation Analysis

Page 7: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Canonical Correlation Analysis

Abstract view:• Word-vector representations of documents (or queries, or whatever text span) are only superficial manifestations of a deeper vector representation based on concepts.

– Since they cannot be observed directly, these concepts are latent

• If two spans are the translation of one another, their deep representation in terms of concepts is the same.• Can we recover (at least approximately) the latent concept space? Can we learn to map text spans from their superficial word appearance into their deep representation?

– CCA:• Assume mapping from deep to superficial representation is linear• Estimate mapping from empirical data

Page 8: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Five documents in the world of concepts

13

2

54

Z = [z1;z2;z3;z4;z5]

c1

c2

Page 9: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

The same five documents in two languages

1

3

2

5

4

X = [x1;x2;x3;x4;x5];xi 2 <nx Y = [y1;y2;y3;y4;y5];yi 2 <ny

1

2

3

4

5c1

c2c1

c2

e1

e2

f 1

f 2

Page 10: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Finding the first Canonical Variates

1

3

2

5

4

1

2

3

4

5

e1

e2

f 1

f 2

5’’

3’’

1’’

4’’

2’’

1’

2’

4’

3’

5’

Page 11: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

(wx1;wy

1) = arg maxwx ;wy

E[w0xxy0wy ]

qE[w0

xxx0wx]E[w0yyy0wy ]

Finding the first Canonical Variates

Find the two directions, one for each language, such that projections of documents are maximally correlated.Assuming data matrices X and Y are (row-wise) centered:

(wx1;wy

1) = arg maxwx ;wy

w0xX Y 0wyp

w0xX X 0wxw0

yY Y 0wy

Maximal covariance to work back the rotation

C1 expressed in the basis of X and Y

resp.

Normalization by the variances to adjust for “stretched” dimensions

Page 12: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Finding the first Canonical Variate

Find the two directions, one for each language, such that projections of documents are maximally correlated

(wx1;wy

1) = arg maxwx wy

w0xX Y 0wy

s.t. w0xX X 0wx = 1;w0

yY Y 0wy = 1

Turns out equivalent to finding the largest eigen-pair in a Generalized Eigenvalue Problem (GEP):

·0 C xy

C yx 0

¸ ·wx

wy

¸= ¸

·C xx 00 C yy

¸ ·wx

wy

¸(1)

Cxx = X X 0;Cyy = Y Y 0;Cxy = X Y 0;Cyx = Y X 0

Complexity:

O((nx + ny)3)

(wx1;wy

1) = arg maxwx ;wy

w0xX Y 0wyp

w0xX X 0wxw0

yY Y 0wy

Page 13: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Finding further Canonical Variates

Assume we already found i-1 pairs of Canonical Variates:

Turns out equivalent to finding the other eigen-pairs in the same GEP

(wxi ;wy

i ) = arg maxwx wy

w0xX Y 0wy

s.t. w0x

i X X 0wxi = 1;w0

yi Y Y 0wy

i = 1

w0x

i X X 0wxj = 0;w0

yi Y Y 0wy

j = 0;8j < i

Page 14: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Examples from the Hansard Corpus

Page 15: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Kernel CCA

• Cubic complexity in the number of dimensions becomes soon intractable, especially with text• Also, it could be better to use similarity measures other than inner product of document (possibly weighted) vectors

Kernel CCA: from primal to dual formulation, since it can be proved that the wx

i (resp. wyi) is in the span of the columns of

X (resp. Y)

Page 16: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Kernel CCA

The computation is again done by solving a GEP:

·0 K xK y

K yK x 0

¸ ·¯x

¯y

¸= ¸

·K 2

x 00 K 2

y

¸ ·¯x

¯y

¸(1)

·0 K xK y

K yK x 0

¸ ·¯x

¯y

¸= ¸

·K 2

x 00 K 2

y

¸ ·¯x

¯y

¸Complexity: O(m3)

(¯ ix ;¯ i

y ) = argmax¯ x ;¯ y

¯0xK xK y¯y

s.t. ¯0xK 2

x¯x = 1;¯0yK 2

y¯y = 1

¯ ix

0K 2

x¯ jx = 0;¯ i

y0K 2

y¯ jy = 0;8j < i

Page 17: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Overfitting

E.g. two (centered) points in R2:

11

2

2

Problem: if m · nx and m · ny then there are (infinite) trivial solutions with perfect correlation : OVERFITTINGGiven an

arbitrary direction in the

first space...

...we can find one with perfect

correlation in the second

8¯x s.t. ¯x0K 2

x¯x = 1 set ¯y = K ¡ 1y K x¯x

then ¯y0K 2

y¯y = ¯x0K xK ¡ 1

y K 2yK ¡ 1

y K x¯x = ¯x0K 2

x¯x = 1and ¯y

0K yK x¯x = ¯x0K xK ¡ 1

y K yK x¯x = ¯x0K 2

x¯x = 1

Unit variancesUnit variances

Unit covariancePerfect correlation... for

no matter what direction!

Page 18: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Regularized Kernel CCAWe can regularize the objective function by trading correlation against good account of variance in the two spaces:

~K = (1¡ · )K + · I ; · 2 [0;1]

(¯ ix ;¯ i

y ) = argmax¯ x ;¯ y

¯0xK xK y¯y

s.t. ¯0x

~K xK x¯x = 1;¯0y

~K yK y¯y = 1

¯0x

i ~K xK x¯xj = 0;¯0

yi ~K yK y¯y

j = 0;8j < i

·0 K xK y

K yK x 0

¸ ·¯x

¯y

¸= ¸

· ~K xK x 00 ~K yK y

¸ ·¯x

¯y

¸

Page 19: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Multiview CCA

(K)CCA can take advantage of the “mutual information” between two languages...

12

3

4

5

c1

c2

f 1

f 2

1

3

2

5

4c1

c2

e1

e2

...but what if we have more than two? Can we benefit from multiple views? Also known as Generalised CCA.

12

3

4

5

12

3

4

5

Page 20: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Multiview CCA

There are many possible ways to combine pairwise correlations between views (e.g. sum, product, min, ...).

Chosen approach: SUMCOR [Horst-61]. With a slightly different regularization than above, this is:

(¯1i ; : : : ;¯k

i ) = arg max¯ 1 ;:::;¯ k

X

p<q

¯0pK pK q¯q

s.t. ¯0p

~K 2p¯p = 1;8p

¯0p

~K 2p¯p

j = 0;8j < i

2

64

A1;1 : : : A1;k...

......

Ak;1 : : : Ak;k

3

75

2

64

¯1...

¯k

3

75 =

2

4¸1¯1

: : :¸k¯k

3

5

. Multivariate Eigenvalue

Problem

Page 21: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Multiview CCA

• Multivariate Eigenvalue Problems (MEP) are much harder to solve then GEPs:

– [Horst-61] introduced an extension to MEPs of the standard power method for EPs, for finding the set of first canonical variates only

– Naïve implementations would be quadratic in the number of ducuments, and scale up to no more than a few thousands

Page 22: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Innovations from SMART

Page 23: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Innovations from SMART

• Extensions of the Horst algorithm [Rupnik and Shawe-Taylor]– Efficient implementation linear in the number of

documents– Version for finding many sets of canonical variates

• New regression-CCA framework for CLIR [Rupnik and Shawe-Taylor]• Sparse KCCA [Hussain and Shawe-Taylor]

Page 24: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Efficient Implementation of Horst algorithmHorst algorithm starts with a random set of vectors:

2

64

¯1;t+1...

¯k;t+1

3

75 =

2

64

A1;1 : : : A1;k...

......

Ak;1 : : : Ak;k

3

75

2

64

¯1;t...

¯k;t

3

75

(¯1;0; : : : ;¯m;0)

then iteratively multiplies and renormalizes until convergence:

Inner loop: k2 matrix-vector multiplications,

each O(m2)

Extension (1): exploiting the structure of the MEP matrix, one can refactor computation and save a O(k) factor in the inner loop.

Extension (2): exploiting sparseness of the document vectors, one can replace each (vector) multiplication with a kernel matrix (O(m2)) with two multiplications with the document matrix (O(ms) each, where s is the max number of non-zero components in document vectors). Leveraging this same sparsity, kernel inversions can be replaced by cheaper numerical linear system resolutions.

The inner loop can be made O(kms) instead

of O(k2m2)

Page 25: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Extended Horst algorithm for finding many sets of canonical variates

Horst algorithm only finds the first set of k canonical variates

Extension (3): maintain projection matrices Pit that project ¯k,t’s at

each iteration onto the subspace orthogonal to all previous canonical variates for space i.

Finding d sets of canonical variates can be done in

O(d2mks). This scales up!

Page 26: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

MCCA: Experiments

• Experiments: mate retrieval with Europarl• 10 languages, • 100,000 10-ways aligned sentences for training• 7873 10-ways aligned sentences for testing• Document vectors: uni-, bi- and tri-grams (~200k features for each language). TF*IDF weighting and length normalization.• MCCA used to extract d = 100-dimensional subspaces• Baseline alternatives for selecting new basis:

– k-means clustering centroids on concatenated multi-lingual document vectors

– CL-LSI, i.e. LSI on concatenated vectors

Page 27: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Some example latent vectors

Page 28: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

MCCA experiment results

Measure: recall in Top 10, averaged over 9 languages

“Query” Language

K-means CL-LSI MCCA

EN 0.7486 0.9129 0.98830.9883

SP 0.7450 0.9131 0.98550.9855

GE 0.5927 0.8545 0.97780.9778

IT 0.7448 0.9022 0.98360.9836

DU 0.7136 0.9021 0.98350.9835

DA 0.5357 0.8540 0.98740.9874

SW 0.5312 0.8623 0.98800.9880

PT 0.7511 0.9000 0.98740.9874

FR 0.7334 0.9116 0.98880.9888

FI 0.4402 0.7737 0.98300.9830

Page 29: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

MCCA experiment results

More realistic experiment: now pseudo-queries formed with top 5 TF*IDF scoring components in each sentence

“Query” Language

K-means CL-LSI MCCA

EN 0.1319 0.2348 0.44130.4413

SP 0.1258 0.2226 0.41090.4109

GE 0.1333 0.2492 0.41580.4158

IT 0.1330 0.2343 0.43730.4373

DU 0.1339 0.2408 0.43690.4369

DA 0.1376 0.2517 0.42320.4232

SW 0.1376 0.2499 0.40380.4038

PT 0.1274 0.2187 0.40750.4075

FR 0.1300 0.2262 0.39310.3931

FI 0.1340 0.2490 0.41790.4179

Page 30: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Extension (4): Regression - CCA

Given a query q in one language, find the target language vector w which is maximally correlated to it:

w¤ = argmaxw

q0X Y 0w

s.t.12w0((1¡ · )Y Y 0+ · I )w = 1

Solution: w¤ = ((1¡ · )Y Y 0+ · I )¡ 1(Y X 0q)

Given this “query translation” we can then find the closest target documents using the standard cosine measure

Promising initial results on CLEF/GIRT dataset: better then standard CCA, but cannot take thesaurus into account, so MAP still not competitive with the best

Page 31: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Extension (5): Sparse - KCCA

• Seeking sparsity in dual solution: first canonical variates expressed as linear combinations of only relatively few documents

– Improved efficiency– Alternative regularization

Same set of indices i

Page 32: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Sparse - KCCA

(¯x ;¯y ) = argmax¯ x ;¯ y

¯0xK x[i; :]K y[:; i]̄ y

s.t. ¯0xK 2

x[i; i]̄ x = 1;¯0yK 2

y[i; i]̄ y = 1

For a fixed set of indices i:

·0 K xy[i; i]

K yx[i; i] 0

¸ ·¯x

¯y

¸= ¸

·K 2

x[i; i] 00 K 2

y[i; i]

¸ ·¯x

¯y

¸

But how do we select i ?

Page 33: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Sparse – KCCA: Algorithms

Algorithm 1

1. initialize

2. for i = 1 to d do

Deflate kernel matrices

3. end for

4. Solve GEP for index set i

Algorithm 2• Set i to the index of the top d values of

• Solve GEP for index set i

Deflation consists in transforming the matrices to reflect a

projection onto the space orthogonal to the current basis in

feature space

Page 34: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Sparse – KCCA: Mate retrieval experiments

Europarl, English-Spanish KCCATrain: 24693 sec.Test: 27733 sec.

SKCCA (1)Train: 5242 sec.Test: 698 sec.

SKCCA (2)Train: 1873 sec.Test: 695 sec.

Page 35: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

SMART - Website

Project presentation and deliverables• http://www.smart-project.eu

D 5.1 on lexicon-based methods

and D 5.2 on CCA

Page 36: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

SMART - Dissemination and Exploitation

Platforms for showcasing developed tools:

Page 37: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Thank you!

Page 38: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Shameless plug

Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds: Learning Machine Translation, MIT Press, to appear in 2009.

Page 39: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

References

[Hardoon and Shawe-Taylor]

David Hardoon and John Shawe-Taylor, Sparse CCA for Bilingual Word Generation, in 20th Mini-EURO Conference of the Continuous Optimization and Knowledge Based Technologies, Neringa, Lithuania, 2008.

[Hussain and Shawe-Taylor]

Zakria Hussain and John Shawe-Taylor, Theory of Matching Pursuit, in Neural Information Processing Systems (NIPS), Vancouver, BC, 2008.

[Rupnik and Shawe-Taylor]

Jan Rupnik and John Shawe-Taylor, contribution to SMART deliverable D 5.2 “Multilingual Latent Language-Independent Analysis Applied to CLTIA Tasks” (http://www.smart-project.eu/files/D52.pdf)

Page 40: Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information.

Self-introduction

Natural Language

Generation

Grammar Learning

Text Categorization

Machine Learning (kernels for text)

(Statistical) Machine

Translationca. 2004