Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
-
Upload
gilbert-warren -
Category
Documents
-
view
216 -
download
1
Transcript of Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
![Page 1: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/1.jpg)
EXTRACTION OF TRANSLATION CORRESPONDENCES FROM A PARALLEL CORPUS
USING METHODS OF DISTRIBUTIONAL SEMANTICS
Yuliya MorozovaInstitute for Informatics Problems of the Russian Academy of Sciences, Moscow
![Page 2: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/2.jpg)
Distributional semanticsnew area of linguistic researchinferring semantic properties of linguistic
units from corporaTheoretical foundations: distributional
methodology by Z. Harris, F. de Saussure, L. Wittgenstein.
Distributional hypothesis: semantically similar words occur in similar contexts.
J. R. Firth “You shall know a word by the company it keeps”.
![Page 3: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/3.jpg)
Vector spacedrink coffee – occurred 1 timedrink tea – occurred 2 times
![Page 4: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/4.jpg)
Cosine measure of vector similarity
n
i i
n
i i
n
i ii
yx
yx
1
2
1
2
1
![Page 5: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/5.jpg)
Main application areaslexical ambiguity resolutioninformation retrievaldictionaries of semantic relationsmultilingual dictionariessemantic maps of different domainsmodelling of synonymydocument topic detectionsentiment analysis
![Page 6: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/6.jpg)
The present researchGoal: to apply distributional semantics
models to extraction of translation correspondences from a parallel corpus.
Vector space model + test corpus
![Page 7: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/7.jpg)
Test corpusPatent texts in French translated into Russian
Texts splitted into sentencesAlignment at the sentence level – manually
verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus
manager
![Page 8: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/8.jpg)
PreprocessingLemmatizationFrequent words removed (prepositions ,
conjunctions etc.)Punctuation marks removed
![Page 9: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/9.jpg)
Vector space model
type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency
(equal either to 1 or 0); method used to compute the distance
between vectors: cosine measure.
![Page 10: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/10.jpg)
Example (aligned region as a context)Aligned region #1
présent invention concerner liant minéral notamment hydraulique
настоящий изобретение касаться неорганический связующий частность гидравлический связующий
![Page 11: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/11.jpg)
Example (vector space)
Aligned region
#1 #2 #3
présent 1 … …
invention 1 … …
concerner 1 … …
настоящий 1 … …
изобретение
1 … …
касаться 1 … …
![Page 12: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/12.jpg)
ResultsA list of translation correspondences.
Linguistic filter: the same part of speech.
Precision: 78%.
![Page 13: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/13.jpg)
Correspondences with different POS
Syntactic transformations
verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”)
noun (French) → adjective (Russian)
crochet (“hook”) → крюкообразный (“hook-shaped”)
verbal infinitive (French) → adjective (Russian)
connaître (“to know”) → известный (“well-known”)
![Page 14: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/14.jpg)
Correspondences with different POS
Parts of multi-word expressionsau moins (“at least”) → по меньшей мере (“at least”)
The output of the program:moins → мера
![Page 15: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/15.jpg)
EvaluationEduardo Cendejas, Grettel Barceló,
Alexander Gelbukh, Grigori Sidorov . Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009.
Vector space model + similarity measures PMI, T-score, Log-likelihood ratio and Dice coefficient.
Precision – 53%.
![Page 16: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/16.jpg)
ConclusionDistributional semantics methodology can be
used to extract translation correspondences from a parallel corpus with a high level of precision.
It can be used to study productive syntactic transformations occurring in translation.
The present vector space model needs to be enhanced to take into account multi-word expressions.
![Page 17: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.](https://reader036.fdocuments.in/reader036/viewer/2022081700/56649db55503460f94aa61c6/html5/thumbnails/17.jpg)
Thank you!