Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G....
-
Upload
marian-cobb -
Category
Documents
-
view
221 -
download
2
Transcript of Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G....
Leveraging Reusability: Leveraging Reusability: Cost-effective Lexical Cost-effective Lexical Acquisition for Large-scale Acquisition for Large-scale Ontology TranslationOntology Translation
G. Craig Murray et al.G. Craig Murray et al.COLING 2006COLING 2006Reporter Yong-Xiang ChenReporter Yong-Xiang Chen
Background and Background and ProblemProblem Thesauri and ontologies provide important Thesauri and ontologies provide important
value in value in facilitating access to digital archivefacilitating access to digital archivess by by – representing underlying principles of representing underlying principles of organizatiorganizati
onon Translation of such resources into multiple Translation of such resources into multiple
languageslanguages is an important component is an important component– Specificity of vocabulary terms in most ontologSpecificity of vocabulary terms in most ontolog
ies precludes fully-automated machine translaties precludes fully-automated machine translation using ion using general-domain lexical resourcesgeneral-domain lexical resources
Research ApproachResearch Approach
Present an efficient process for Present an efficient process for leveraging leveraging human translationshuman translations when when constructing domain-specific lexical constructing domain-specific lexical resourcesresources
Evaluate the effectiveness of this Evaluate the effectiveness of this process by producing a process by producing a probabilistic probabilistic phrase dictionaryphrase dictionary and translating a and translating a thesaurus of 56,000 concepts used to thesaurus of 56,000 concepts used to catalogue a large archive of oral catalogue a large archive of oral historieshistories
PurposePurpose
If we need humans to assist in the If we need humans to assist in the translation process, how can we translation process, how can we maximize access while minimizing maximize access while minimizing cost?cost?
Reuse!!Reuse!!Useful First!!Useful First!!
Specific ProblemSpecific Problem
MostMost digital collections of any digital collections of any significant size significant size use a system of use a system of organizationorganization that facilitates easy that facilitates easy access to collection contentsaccess to collection contents– The organizing principles are captured in The organizing principles are captured in
the form of a the form of a controlled vocabularycontrolled vocabulary of of keyword phraseskeyword phrases
– Usually arranged in Usually arranged in a hierarchic thesaurusa hierarchic thesaurus or ontologyor ontology
IdeaIdea
Collected 3,000 manual translations of Collected 3,000 manual translations of keyword phrases keyword phrases
Reused the translated termsReused the translated terms to to generate a lexicongenerate a lexicon for for automated automated translationtranslation of the rest of the thesaurus of the rest of the thesaurus
Priority is in terms ofPriority is in terms of– value in value in accessing the collectionaccessing the collection– the reusability of their the reusability of their component termscomponent terms
Checked and Aligned Checked and Aligned the Translationsthe Translations Translations collected from one humTranslations collected from one hum
an informantan informant Checked and aligned to the original Checked and aligned to the original
English terms by a English terms by a second informantsecond informant Induce aInduce a probabilistic probabilistic English-Czech English-Czech
phrase dictionaryphrase dictionary
Maximizing Value and Maximizing Value and ReusabilityReusability To quantify the utility of 3,000 manuTo quantify the utility of 3,000 manu
al translations of keyword phrasesal translations of keyword phrases– Define Define two two values forvalues for each keyword each keyword
phrase in the thesaurusphrase in the thesaurus Thesaurus valueThesaurus value, ,
– representing representing the importance of the keyword phrthe importance of the keyword phrasease for providing access to the collection for providing access to the collection
Translation valueTranslation value– representing the usefulness of having the keyworepresenting the usefulness of having the keywo
rd phrase translatedrd phrase translated The second is related to the firstThe second is related to the first
Keyword hierarchy Keyword hierarchy
Meaning of nodesMeaning of nodes
Internal (non-leaf) nodesInternal (non-leaf) nodes of the hiera of the hierarchy are used to organize concepts archy are used to organize concepts and support concept browsing nd support concept browsing
Leaf nodesLeaf nodes are very specific and are are very specific and are only used to index video content only used to index video content
Example to Example to Thesaurus valueThesaurus value
The keyword phrase “The keyword phrase “Auschwitz IIBirkenauAuschwitz IIBirkenau (Poland: (Poland: Death Camp)” Death Camp)”– Which describes a Nazi death camp Which describes a Nazi death camp – Assigned to 17,555 video segments in the collection Assigned to 17,555 video segments in the collection – Has Has broader (parent) termsbroader (parent) terms and and narrower (child) termsnarrower (child) terms
““German death camps” is German death camps” is not assigned to any video segmentsnot assigned to any video segments However, “German death camps” However, “German death camps” has very important narrower has very important narrower
termsterms including “Auschwitz II-Birkenau” and others including “Auschwitz II-Birkenau” and others
Thesaurus valueThesaurus value
Represents the importance of each Represents the importance of each keyword phrase to the thesaurus keyword phrase to the thesaurus
An internal node is valuable in An internal node is valuable in providing access to its children providing access to its children
But value a node by the sum value of But value a node by the sum value of all its children, grandchildren, etc., all its children, grandchildren, etc., the the resulting calculation would biasresulting calculation would bias the the top top of the hierarchy of the hierarchy
Thesaurus valueThesaurus value
Leaf node:Leaf node: the number of video the number of video segments to which the concept has segments to which the concept has been assigned been assigned
Parent node:Parent node: plus plus the average of the the average of the thesaurus value thesaurus value of any child nodes of any child nodes
The final values quantify The final values quantify how valuable how valuable the translation of any given keyword the translation of any given keyword phrasephrase would be in providing access to would be in providing access to video segments video segments
Translation valueTranslation value
Compute the Compute the translation value translation value for for each word in the vocabulary as each word in the vocabulary as the the sum of the thesaurus valuesum of the thesaurus value for every keyword phrase that for every keyword phrase that contains that word contains that word
Use these valuesUse these values
The end result is The end result is a list of vocabulary a list of vocabulary wordswords and the and the impact that correct impact that correct translationtranslation of each word would have on of each word would have on the overall value of the translated the overall value of the translated thesaurus thesaurus
We elicited human translations of We elicited human translations of entire keyword phrasesentire keyword phrases rather than rather than individual vocabulary terms individual vocabulary terms
PrioritizePrioritize
The value gained by translating any given pThe value gained by translating any given phrase is hrase is more accuratelymore accurately estimated by the estimated by the ttotal valueotal value of any untranslated words it cont of any untranslated words it contains ains
Prioritized the order of keyword phrase tranPrioritized the order of keyword phrase translations based on the slations based on the translation value translation value of thof the untranslated words in each keyword phrae untranslated words in each keyword phrase se
Prioritizing their translation based on the asPrioritizing their translation based on the assumption that any words contained in a keysumption that any words contained in a keyword phrase of higher priority would alreadword phrase of higher priority would already have been translated y have been translated
Alignment Alignment
Obtained professional translations for Obtained professional translations for the the top 3000 Englishtop 3000 English keyword phrases keyword phrases
This second informant: This second informant: – tokenized these translationstokenized these translations and and
presented them to another bilingual Czech presented them to another bilingual Czech speaker for verification and alignment speaker for verification and alignment
Alignment process was then used to Alignment process was then used to build a probabilistic dictionarybuild a probabilistic dictionary of words of words and phrases and phrases
Example of alignmentExample of alignment
Machine Translation Machine Translation
It first scans the English input to It first scans the English input to find the lfind the longest matching substringongest matching substring in our dictiona in our dictionary, and replaces it with the most likely Czery, and replaces it with the most likely Czech translationch translation
Looks up “monasteries and convents stillLooks up “monasteries and convents stills” in the dictionarys” in the dictionary– finds no translation,finds no translation,
backs off to “monasteries and conventsbacks off to “monasteries and convents””– translated to “klás@tery” translated to “klás@tery”
Gain rate of access value
ExperimentExperiment
MALACH projectMALACH project– an NSF-funded effort to improve multilingual an NSF-funded effort to improve multilingual
information access to large archives of information access to large archives of spoken spoken languagelanguage
Leverages a small set of manually acquired Leverages a small set of manually acquired English-Czech translations to translate a English-Czech translations to translate a large ontology of keyword phraseslarge ontology of keyword phrases
Evaluation Evaluation
Compared our system output to Compared our system output to huhuman reference translationsman reference translations using Ble using Bleu (Papineni, et al., 2002)u (Papineni, et al., 2002)
Showed corrected and uncorrected Showed corrected and uncorrected machine translations to Czech speakmachine translations to Czech speakers and collected subjective judgmeers and collected subjective judgments of nts of fluencyfluency and and accuracyaccuracy
Bleu ScoresBleu Scores
Subjective Judgment Subjective Judgment ScoreScore selected 418 keyword phrases to selected 418 keyword phrases to
be used as target translations be used as target translations
ConclusionConclusion
Demonstrate that prioritization Demonstrate that prioritization based on hierarchical position and based on hierarchical position and frequency of use facilitates frequency of use facilitates extremely efficient reuse of extremely efficient reuse of human inputhuman input
Evaluations show that our Evaluations show that our technique boost performance of a technique boost performance of a simple translation system by 65%.simple translation system by 65%.