CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting...

CLEF2003 Forum/ August 2003 / Trondheim / page 1

Report on CLEF-2003 ML4 experiments Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corporaExtracting multilingual resources from corpora

Report on CLEF-2003 ML4 experiments Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corporaExtracting multilingual resources from corpora

N. Cancedda, H. Dejean, E. Gaussier, J-M RendersN. Cancedda, H. Dejean, E. Gaussier, J-M Renders

Xerox Research Center Europe Xerox Research Center Europe

Alexei VinokourovAlexei Vinokourov

Royal Holloway University of LondonRoyal Holloway University of London


AgendaAgendaAgendaAgenda

• Objective and meansObjective and means

• Linguistic PreprocessingLinguistic Preprocessing

• MethodsMethods

– Canonical Correlation Analysis (CCA) for CLIRCanonical Correlation Analysis (CCA) for CLIR

– Combining lexicons automatically extracted from parallel and comparable Combining lexicons automatically extracted from parallel and comparable corporacorpora

• ResultsResults


Objectives and MeansObjectives and MeansObjectives and MeansObjectives and Means

• How to improve the adequacy of existing resources (dictionaries) How to improve the adequacy of existing resources (dictionaries) to translate queries:to translate queries:

– Coverage?Coverage?

– Precision (translation adapted to the corpus)?Precision (translation adapted to the corpus)?

• First way: exploit parallel corporaFirst way: exploit parallel corpora

– Extract semantic, language-independent representationExtract semantic, language-independent representation

– Extract bilingual lexiconsExtract bilingual lexicons

• Second way: exploit comparable corporaSecond way: exploit comparable corpora

– Extract (probabilistic) translation relationshipsExtract (probabilistic) translation relationships

– Must be combined with other translation resources (parallel) Must be combined with other translation resources (parallel)


The Task (first participation)The Task (first participation)The Task (first participation)The Task (first participation)

• Multi-lingual 4:Multi-lingual 4:– English, German, Spanish, FrenchEnglish, German, Spanish, French

• Fully automatic approach (no manual processing of the Fully automatic approach (no manual processing of the queries)queries)

• Query language:Query language:– EnglishEnglish

• Performance measure:Performance measure:– Non-interpolated average precision (non limited to 1000 documents)Non-interpolated average precision (non limited to 1000 documents)

– Macro-average on all queries:Macro-average on all queries:

• Before submission (training): from 2000 to 2002 (140 queries)Before submission (training): from 2000 to 2002 (140 queries)

• After submission (evaluation): from 2001 to 2003After submission (evaluation): from 2001 to 2003


Resources we usedResources we usedResources we usedResources we used

• General Dictionary: ELRA (40k entries)General Dictionary: ELRA (40k entries)

• Parallel corpora:Parallel corpora:

– Hansard corpus (for CCA) – only French-EnglishHansard corpus (for CCA) – only French-English

– JOC corpus (for lexicon extraction) JOC corpus (for lexicon extraction) – 300,000 sentences– 300,000 sentences

• Comparable corpora:Comparable corpora:

– The CLEF2003 corporaThe CLEF2003 corpora


Summary of approachesSummary of approachesSummary of approachesSummary of approaches

• Semantic ProjectionSemantic Projection

– A semantic, language independent A semantic, language independent space, is extracted from a parallel space, is extracted from a parallel training corpustraining corpus

– Language-dependent projection Language-dependent projection matrices are builtmatrices are built

– Both documents and queries are Both documents and queries are projectedprojected

– Standard cosine measure is then Standard cosine measure is then used in the new space to perform used in the new space to perform IRIR

• Query translationQuery translation

– A probabilistic translation matrix is A probabilistic translation matrix is extracted from a parallel training extracted from a parallel training corpus and the comparable CLEF corpus and the comparable CLEF corpuscorpus

– Queries are translated by these Queries are translated by these translation matricestranslation matrices

– Standard cosine measure is then Standard cosine measure is then used between the original used between the original documents and the translated documents and the translated queryquery


Linguistic PreprocessingLinguistic PreprocessingLinguistic PreprocessingLinguistic Preprocessing

• Lemmatized and (POS)tagged corporaLemmatized and (POS)tagged corpora

• Partial segmentation of German compounds (lexicon-based) Partial segmentation of German compounds (lexicon-based) + some simple heuristics+ some simple heuristics

• Normalization of spelling and accentuation (e.g. umlaut and Normalization of spelling and accentuation (e.g. umlaut and eszett)eszett)

• POS-based word filtering (N,V,AD)POS-based word filtering (N,V,AD)

• Single word entries only (for the dictionaries, queries and Single word entries only (for the dictionaries, queries and documents) – Note that the adopted approaches for documents) – Note that the adopted approaches for translation are context-dependant to some extent.translation are context-dependant to some extent.


CCA for CLIRCCA for CLIRCCA for CLIRCCA for CLIR

• Given a set of paired observations (paired sentences or Given a set of paired observations (paired sentences or paragraphs), Canonical Correlation Analysis finds maximally paragraphs), Canonical Correlation Analysis finds maximally correlated projectionscorrelated projections

s1

s2s3

t2

t1 t3


CCA for CLIR (II)CCA for CLIR (II)CCA for CLIR (II)CCA for CLIR (II)

• CCA looks for particular combinations of terms that appear to have CCA looks for particular combinations of terms that appear to have the same co-occurrence patterns in both languagesthe same co-occurrence patterns in both languages

• Hypothesis: the only thing both languages have in common is their Hypothesis: the only thing both languages have in common is their meaning (cond. Independ.)meaning (cond. Independ.)

• Then, these (linear) combinations of terms are able to locate the Then, these (linear) combinations of terms are able to locate the underlying semanticsunderlying semantics

• Results in language-independent concepts and the corresponding Results in language-independent concepts and the corresponding (language-dependant) projection operators (language-dependant) projection operators

• Both queries and documents are projected – Traditional similarity Both queries and documents are projected – Traditional similarity measures (cosine) are then used for retrievalmeasures (cosine) are then used for retrieval


Extraction of bilingual resourcesExtraction of bilingual resourcesExtraction of bilingual resourcesExtraction of bilingual resources

• Upper bound of the coverage Upper bound of the coverage for the CLEF200x English query for the CLEF200x English query termsterms

• Automatically extracted lexicons Automatically extracted lexicons provides better coverage, but provides better coverage, but translation accuracy can be translation accuracy can be degradeddegraded

• Use of some form of trade-off Use of some form of trade-off between the resources between the resources (manual/automatic)(manual/automatic)

0

0.2

0.4

0.6

0.8

1

Coverage 0.78 0.78 0.8 0.9 0.98

ElraOxford-HT

Hansard

JOC ML4


Extracting lexicons from parallel corpora Extracting lexicons from parallel corpora Extracting lexicons from parallel corpora Extracting lexicons from parallel corpora

• Statistical Alignment methods :Statistical Alignment methods :

– starting from alignment at the sentence levelstarting from alignment at the sentence level

– Iterative Proportional Fitting Procedure (normalizing and restoring Iterative Proportional Fitting Procedure (normalizing and restoring consistency in the raw co-occurrence matrix of source/target terms in consistency in the raw co-occurrence matrix of source/target terms in aligned sentences)aligned sentences)

– Probabilistic translation matrix: PProbabilistic translation matrix: P11(t|s)(t|s)


Extracting lexicons from comparable corporaExtracting lexicons from comparable corporaExtracting lexicons from comparable corporaExtracting lexicons from comparable corpora

• Assumption: Assumption: if 2 words are mutual translations, their more frequent if 2 words are mutual translations, their more frequent collocates are likely to be mutual translations as well collocates are likely to be mutual translations as well

• Corresponding method: Corresponding method:

– Build context vectors for source words Build context vectors for source words ss: CV(: CV(ss))

– Build context vectors for target words Build context vectors for target words tt: CV(: CV(tt))

– Translate the context vectors using standard dictionary (as a bootstrap): Translate the context vectors using standard dictionary (as a bootstrap): TR(CV(TR(CV(tt))))

– Compute the similarity between Compute the similarity between ss and and tt by cos(CV( by cos(CV(ss),TR(CV(),TR(CV(tt))))

– Normalize the similarities to yield a probabilistic translation lexicon PNormalize the similarities to yield a probabilistic translation lexicon P22((tt||ss))

– NB: CV are based on windows centered on NB: CV are based on windows centered on ss or or tt, and weighted by some , and weighted by some association measure (such as Mutual Information); the word itself is included in association measure (such as Mutual Information); the word itself is included in the CV the CV bias for dictionary entries bias for dictionary entries


Hybrid Method : model combination Hybrid Method : model combination Hybrid Method : model combination Hybrid Method : model combination

• In some cases, the information provided by the comparable In some cases, the information provided by the comparable corpus is more reliable; in other cases, the information corpus is more reliable; in other cases, the information extracted from the parallel one is best.extracted from the parallel one is best.

• We adopted a simple linear combination scheme, but more We adopted a simple linear combination scheme, but more elaborate approaches existelaborate approaches exist

qqtt=(=( P P11(t|s) + (1-(t|s) + (1-) P) P22(t|s)) q(t|s)) qss

• We optimized We optimized on the queries 2000-2002 (performance on the queries 2000-2002 (performance measure: average precision) measure: average precision)


Multilingual mergingMultilingual mergingMultilingual mergingMultilingual merging

• As we used consistent translation matrices and weighting As we used consistent translation matrices and weighting scheme for all languages, only length normalization was scheme for all languages, only length normalization was performed before merging the scoresperformed before merging the scores

• We also extracted a PWe also extracted a P22(t|s) translation matrix for English; this (t|s) translation matrix for English; this

realizes some kind of query expansion based on contextual realizes some kind of query expansion based on contextual similarity.similarity.

t

tt q

qq


Weighting schemesWeighting schemesWeighting schemesWeighting schemes

• For submission:For submission:– Documents: ltcDocuments: ltc

– Query Query

• before translation: ntnbefore translation: ntn

• After translation: nncAfter translation: nnc

• After submissionAfter submission– Documents: LnuDocuments: Lnu

– Query: ntn (before) , nic (after)Query: ntn (before) , nic (after)

• Measure of association in the context vector:Measure of association in the context vector:– Mutual informationMutual information

– Window size: 5Window size: 5


Results (1)Results (1)Results (1)Results (1)

• CCA: failedCCA: failed

– Only bilingualOnly bilingual

– Based on a small set of Hansard (disjoint from CLEF2003)Based on a small set of Hansard (disjoint from CLEF2003)

– The training corpus was reduced to 1000 paragraphs to be practically The training corpus was reduced to 1000 paragraphs to be practically feasible and to provide results on timefeasible and to provide results on time

– To be extended in the futureTo be extended in the future


Results (II) – 2000, 2001 and 2002 queriesResults (II) – 2000, 2001 and 2002 queriesResults (II) – 2000, 2001 and 2002 queriesResults (II) – 2000, 2001 and 2002 queries

00.05

0.10.15

0.20.25

0.30.35

0.40.45

Bilingual (beforemerging)

Multilingual (aftermerging)


Results (Details) – 2000,2001, 2002 queriesResults (Details) – 2000,2001, 2002 queriesResults (Details) – 2000,2001, 2002 queriesResults (Details) – 2000,2001, 2002 queries

Average Precision ELRA Parallel Comparabl Hybrid Monolingual

Bilingual (before merging) 0.29 0.365 0.228 0.388 0.444

Multilingual (after merging) 0.192 0.289 0.165 0.302 0.361

ENG 0.35 0.35 0.364 0.378 0.363

FRE 0.271 0.362 0.188 0.389 0.449

GER 0.276 0.361 0.203 0.38 0.475

SPA 0.304 0.411 0.221 0.431 0.439


Combination Ppar-Pcomp CLEF2003

0.150.2

0.250.3

0.350.4

% of Parallel

Ave

rag

e P

reci

sio

nResults of hybridation parallel/comparableResults of hybridation parallel/comparableResults of hybridation parallel/comparableResults of hybridation parallel/comparable

bilingualbilingual

multilingualmultilingual


Results (details) … after submissionResults (details) … after submissionResults (details) … after submissionResults (details) … after submission

• Mainly focused on changing the weighting scheme (Lnu)Mainly focused on changing the weighting scheme (Lnu)

• Average precision (retrieval limited to 1000 documents):Average precision (retrieval limited to 1000 documents):

SettingSetting Average PrecisionAverage Precision

ltc/ntn/nnc (submitted)ltc/ntn/nnc (submitted) 0.18600.1860

Lnu/ntn/nnc (same tuning as Lnu/ntn/nnc (same tuning as subm.)subm.)

0.21180.2118

Lnu/ntn/ntc (re-optimised Lnu/ntn/ntc (re-optimised tuning)tuning)

0.23410.2341


ConclusionsConclusionsConclusionsConclusions

• Clearly, exploiting parallel and comparable corpora to Clearly, exploiting parallel and comparable corpora to enhance query translation improves CLIR performanceenhance query translation improves CLIR performance

• When considering the monolingual reference line, there is still When considering the monolingual reference line, there is still place for improvementplace for improvement

• Also, different merge strategies must be investigatedAlso, different merge strategies must be investigated

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting...

Documents

Transcript of CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting...