CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting...
-
Upload
ashlee-goodwin -
Category
Documents
-
view
214 -
download
0
Transcript of CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting...
CLEF2003 Forum/ August 2003 / Trondheim / page 1
Report on CLEF-2003 ML4 experiments Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corporaExtracting multilingual resources from corpora
Report on CLEF-2003 ML4 experiments Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corporaExtracting multilingual resources from corpora
N. Cancedda, H. Dejean, E. Gaussier, J-M RendersN. Cancedda, H. Dejean, E. Gaussier, J-M Renders
Xerox Research Center Europe Xerox Research Center Europe
Alexei VinokourovAlexei Vinokourov
Royal Holloway University of LondonRoyal Holloway University of London
CLEF2003 Forum/ August 2003 / Trondheim / page 2
AgendaAgendaAgendaAgenda
• Objective and meansObjective and means
• Linguistic PreprocessingLinguistic Preprocessing
• MethodsMethods
– Canonical Correlation Analysis (CCA) for CLIRCanonical Correlation Analysis (CCA) for CLIR
– Combining lexicons automatically extracted from parallel and comparable Combining lexicons automatically extracted from parallel and comparable corporacorpora
• ResultsResults
CLEF2003 Forum/ August 2003 / Trondheim / page 3
Objectives and MeansObjectives and MeansObjectives and MeansObjectives and Means
• How to improve the adequacy of existing resources (dictionaries) How to improve the adequacy of existing resources (dictionaries) to translate queries:to translate queries:
– Coverage?Coverage?
– Precision (translation adapted to the corpus)?Precision (translation adapted to the corpus)?
• First way: exploit parallel corporaFirst way: exploit parallel corpora
– Extract semantic, language-independent representationExtract semantic, language-independent representation
– Extract bilingual lexiconsExtract bilingual lexicons
• Second way: exploit comparable corporaSecond way: exploit comparable corpora
– Extract (probabilistic) translation relationshipsExtract (probabilistic) translation relationships
– Must be combined with other translation resources (parallel) Must be combined with other translation resources (parallel)
CLEF2003 Forum/ August 2003 / Trondheim / page 4
The Task (first participation)The Task (first participation)The Task (first participation)The Task (first participation)
• Multi-lingual 4:Multi-lingual 4:– English, German, Spanish, FrenchEnglish, German, Spanish, French
• Fully automatic approach (no manual processing of the Fully automatic approach (no manual processing of the queries)queries)
• Query language:Query language:– EnglishEnglish
• Performance measure:Performance measure:– Non-interpolated average precision (non limited to 1000 documents)Non-interpolated average precision (non limited to 1000 documents)
– Macro-average on all queries:Macro-average on all queries:
• Before submission (training): from 2000 to 2002 (140 queries)Before submission (training): from 2000 to 2002 (140 queries)
• After submission (evaluation): from 2001 to 2003After submission (evaluation): from 2001 to 2003
CLEF2003 Forum/ August 2003 / Trondheim / page 5
Resources we usedResources we usedResources we usedResources we used
• General Dictionary: ELRA (40k entries)General Dictionary: ELRA (40k entries)
• Parallel corpora:Parallel corpora:
– Hansard corpus (for CCA) – only French-EnglishHansard corpus (for CCA) – only French-English
– JOC corpus (for lexicon extraction) JOC corpus (for lexicon extraction) – 300,000 sentences– 300,000 sentences
• Comparable corpora:Comparable corpora:
– The CLEF2003 corporaThe CLEF2003 corpora
CLEF2003 Forum/ August 2003 / Trondheim / page 6
Summary of approachesSummary of approachesSummary of approachesSummary of approaches
• Semantic ProjectionSemantic Projection
– A semantic, language independent A semantic, language independent space, is extracted from a parallel space, is extracted from a parallel training corpustraining corpus
– Language-dependent projection Language-dependent projection matrices are builtmatrices are built
– Both documents and queries are Both documents and queries are projectedprojected
– Standard cosine measure is then Standard cosine measure is then used in the new space to perform used in the new space to perform IRIR
• Query translationQuery translation
– A probabilistic translation matrix is A probabilistic translation matrix is extracted from a parallel training extracted from a parallel training corpus and the comparable CLEF corpus and the comparable CLEF corpuscorpus
– Queries are translated by these Queries are translated by these translation matricestranslation matrices
– Standard cosine measure is then Standard cosine measure is then used between the original used between the original documents and the translated documents and the translated queryquery
CLEF2003 Forum/ August 2003 / Trondheim / page 7
Linguistic PreprocessingLinguistic PreprocessingLinguistic PreprocessingLinguistic Preprocessing
• Lemmatized and (POS)tagged corporaLemmatized and (POS)tagged corpora
• Partial segmentation of German compounds (lexicon-based) Partial segmentation of German compounds (lexicon-based) + some simple heuristics+ some simple heuristics
• Normalization of spelling and accentuation (e.g. umlaut and Normalization of spelling and accentuation (e.g. umlaut and eszett)eszett)
• POS-based word filtering (N,V,AD)POS-based word filtering (N,V,AD)
• Single word entries only (for the dictionaries, queries and Single word entries only (for the dictionaries, queries and documents) – Note that the adopted approaches for documents) – Note that the adopted approaches for translation are context-dependant to some extent.translation are context-dependant to some extent.
CLEF2003 Forum/ August 2003 / Trondheim / page 8
CCA for CLIRCCA for CLIRCCA for CLIRCCA for CLIR
• Given a set of paired observations (paired sentences or Given a set of paired observations (paired sentences or paragraphs), Canonical Correlation Analysis finds maximally paragraphs), Canonical Correlation Analysis finds maximally correlated projectionscorrelated projections
s1
s2s3
t2
t1 t3
CLEF2003 Forum/ August 2003 / Trondheim / page 9
CCA for CLIR (II)CCA for CLIR (II)CCA for CLIR (II)CCA for CLIR (II)
• CCA looks for particular combinations of terms that appear to have CCA looks for particular combinations of terms that appear to have the same co-occurrence patterns in both languagesthe same co-occurrence patterns in both languages
• Hypothesis: the only thing both languages have in common is their Hypothesis: the only thing both languages have in common is their meaning (cond. Independ.)meaning (cond. Independ.)
• Then, these (linear) combinations of terms are able to locate the Then, these (linear) combinations of terms are able to locate the underlying semanticsunderlying semantics
• Results in language-independent concepts and the corresponding Results in language-independent concepts and the corresponding (language-dependant) projection operators (language-dependant) projection operators
• Both queries and documents are projected – Traditional similarity Both queries and documents are projected – Traditional similarity measures (cosine) are then used for retrievalmeasures (cosine) are then used for retrieval
CLEF2003 Forum/ August 2003 / Trondheim / page 10
Extraction of bilingual resourcesExtraction of bilingual resourcesExtraction of bilingual resourcesExtraction of bilingual resources
• Upper bound of the coverage Upper bound of the coverage for the CLEF200x English query for the CLEF200x English query termsterms
• Automatically extracted lexicons Automatically extracted lexicons provides better coverage, but provides better coverage, but translation accuracy can be translation accuracy can be degradeddegraded
• Use of some form of trade-off Use of some form of trade-off between the resources between the resources (manual/automatic)(manual/automatic)
0
0.2
0.4
0.6
0.8
1
Coverage 0.78 0.78 0.8 0.9 0.98
ElraOxford-HT
Hansard
JOC ML4
CLEF2003 Forum/ August 2003 / Trondheim / page 11
Extracting lexicons from parallel corpora Extracting lexicons from parallel corpora Extracting lexicons from parallel corpora Extracting lexicons from parallel corpora
• Statistical Alignment methods :Statistical Alignment methods :
– starting from alignment at the sentence levelstarting from alignment at the sentence level
– Iterative Proportional Fitting Procedure (normalizing and restoring Iterative Proportional Fitting Procedure (normalizing and restoring consistency in the raw co-occurrence matrix of source/target terms in consistency in the raw co-occurrence matrix of source/target terms in aligned sentences)aligned sentences)
– Probabilistic translation matrix: PProbabilistic translation matrix: P11(t|s)(t|s)
CLEF2003 Forum/ August 2003 / Trondheim / page 12
Extracting lexicons from comparable corporaExtracting lexicons from comparable corporaExtracting lexicons from comparable corporaExtracting lexicons from comparable corpora
• Assumption: Assumption: if 2 words are mutual translations, their more frequent if 2 words are mutual translations, their more frequent collocates are likely to be mutual translations as well collocates are likely to be mutual translations as well
• Corresponding method: Corresponding method:
– Build context vectors for source words Build context vectors for source words ss: CV(: CV(ss))
– Build context vectors for target words Build context vectors for target words tt: CV(: CV(tt))
– Translate the context vectors using standard dictionary (as a bootstrap): Translate the context vectors using standard dictionary (as a bootstrap): TR(CV(TR(CV(tt))))
– Compute the similarity between Compute the similarity between ss and and tt by cos(CV( by cos(CV(ss),TR(CV(),TR(CV(tt))))
– Normalize the similarities to yield a probabilistic translation lexicon PNormalize the similarities to yield a probabilistic translation lexicon P22((tt||ss))
– NB: CV are based on windows centered on NB: CV are based on windows centered on ss or or tt, and weighted by some , and weighted by some association measure (such as Mutual Information); the word itself is included in association measure (such as Mutual Information); the word itself is included in the CV the CV bias for dictionary entries bias for dictionary entries
CLEF2003 Forum/ August 2003 / Trondheim / page 13
Hybrid Method : model combination Hybrid Method : model combination Hybrid Method : model combination Hybrid Method : model combination
• In some cases, the information provided by the comparable In some cases, the information provided by the comparable corpus is more reliable; in other cases, the information corpus is more reliable; in other cases, the information extracted from the parallel one is best.extracted from the parallel one is best.
• We adopted a simple linear combination scheme, but more We adopted a simple linear combination scheme, but more elaborate approaches existelaborate approaches exist
qqtt=(=( P P11(t|s) + (1-(t|s) + (1-) P) P22(t|s)) q(t|s)) qss
• We optimized We optimized on the queries 2000-2002 (performance on the queries 2000-2002 (performance measure: average precision) measure: average precision)
CLEF2003 Forum/ August 2003 / Trondheim / page 14
Multilingual mergingMultilingual mergingMultilingual mergingMultilingual merging
• As we used consistent translation matrices and weighting As we used consistent translation matrices and weighting scheme for all languages, only length normalization was scheme for all languages, only length normalization was performed before merging the scoresperformed before merging the scores
• We also extracted a PWe also extracted a P22(t|s) translation matrix for English; this (t|s) translation matrix for English; this
realizes some kind of query expansion based on contextual realizes some kind of query expansion based on contextual similarity.similarity.
t
tt q
CLEF2003 Forum/ August 2003 / Trondheim / page 15
Weighting schemesWeighting schemesWeighting schemesWeighting schemes
• For submission:For submission:– Documents: ltcDocuments: ltc
– Query Query
• before translation: ntnbefore translation: ntn
• After translation: nncAfter translation: nnc
• After submissionAfter submission– Documents: LnuDocuments: Lnu
– Query: ntn (before) , nic (after)Query: ntn (before) , nic (after)
• Measure of association in the context vector:Measure of association in the context vector:– Mutual informationMutual information
– Window size: 5Window size: 5
CLEF2003 Forum/ August 2003 / Trondheim / page 16
Results (1)Results (1)Results (1)Results (1)
• CCA: failedCCA: failed
– Only bilingualOnly bilingual
– Based on a small set of Hansard (disjoint from CLEF2003)Based on a small set of Hansard (disjoint from CLEF2003)
– The training corpus was reduced to 1000 paragraphs to be practically The training corpus was reduced to 1000 paragraphs to be practically feasible and to provide results on timefeasible and to provide results on time
– To be extended in the futureTo be extended in the future
CLEF2003 Forum/ August 2003 / Trondheim / page 17
Results (II) – 2000, 2001 and 2002 queriesResults (II) – 2000, 2001 and 2002 queriesResults (II) – 2000, 2001 and 2002 queriesResults (II) – 2000, 2001 and 2002 queries
00.05
0.10.15
0.20.25
0.30.35
0.40.45
Bilingual (beforemerging)
Multilingual (aftermerging)
CLEF2003 Forum/ August 2003 / Trondheim / page 18
Results (Details) – 2000,2001, 2002 queriesResults (Details) – 2000,2001, 2002 queriesResults (Details) – 2000,2001, 2002 queriesResults (Details) – 2000,2001, 2002 queries
Average Precision ELRA Parallel Comparabl Hybrid Monolingual
Bilingual (before merging) 0.29 0.365 0.228 0.388 0.444
Multilingual (after merging) 0.192 0.289 0.165 0.302 0.361
ENG 0.35 0.35 0.364 0.378 0.363
FRE 0.271 0.362 0.188 0.389 0.449
GER 0.276 0.361 0.203 0.38 0.475
SPA 0.304 0.411 0.221 0.431 0.439
CLEF2003 Forum/ August 2003 / Trondheim / page 19
Combination Ppar-Pcomp CLEF2003
0.150.2
0.250.3
0.350.4
% of Parallel
Ave
rag
e P
reci
sio
nResults of hybridation parallel/comparableResults of hybridation parallel/comparableResults of hybridation parallel/comparableResults of hybridation parallel/comparable
bilingualbilingual
multilingualmultilingual
CLEF2003 Forum/ August 2003 / Trondheim / page 20
Results (details) … after submissionResults (details) … after submissionResults (details) … after submissionResults (details) … after submission
• Mainly focused on changing the weighting scheme (Lnu)Mainly focused on changing the weighting scheme (Lnu)
• Average precision (retrieval limited to 1000 documents):Average precision (retrieval limited to 1000 documents):
SettingSetting Average PrecisionAverage Precision
ltc/ntn/nnc (submitted)ltc/ntn/nnc (submitted) 0.18600.1860
Lnu/ntn/nnc (same tuning as Lnu/ntn/nnc (same tuning as subm.)subm.)
0.21180.2118
Lnu/ntn/ntc (re-optimised Lnu/ntn/ntc (re-optimised tuning)tuning)
0.23410.2341
CLEF2003 Forum/ August 2003 / Trondheim / page 21
ConclusionsConclusionsConclusionsConclusions
• Clearly, exploiting parallel and comparable corpora to Clearly, exploiting parallel and comparable corpora to enhance query translation improves CLIR performanceenhance query translation improves CLIR performance
• When considering the monolingual reference line, there is still When considering the monolingual reference line, there is still place for improvementplace for improvement
• Also, different merge strategies must be investigatedAlso, different merge strategies must be investigated