The Web as a Parallel Corpus
description
Transcript of The Web as a Parallel Corpus
The Web as a Parallel Corpus
Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 – 380, MIT Press,2004.
University of Maryland, Johns Hopkins University
Abstract
• STRAND system for mining parallel text on the World Wide Web– reviewing the original algorithm and results– presenting a set of significant enhancements
• use of supervised learning based on structural features of documents to improve classification performance
• new content-based measure of translational equivalence• adaptation of the system to take advantage of the Internet
Archive for mining parallel text from the Web on a large scale
• construction of a significant parallel corpus for a low-density language pair
Introduction
• Parallel corpora, bitexts;– for automatic lexical acquisition (Gale and Ch
urch 1991; Melamed 1997)– provide indispensable training data for statisti
cal translation models (Brown et al. 1990; Melamed 2000; Och and Ney 2002)
– provide the connection between vocabularies in cross-language information retrieval (Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997)
Recent works at UM&JHU
• exploit parallel corpora in order to develop monolingual resources and tools, using a process of annotation, projection, and training
• given a parallel corpus in English and a less resource-rich language
• project English annotations across the parallel corpus to the second language
• using word-level alignments as the bridge, and then use robust statistical techniques in learning from the resulting noisy annotations
(Cabezas, Dorr, and Resnik 2001; Diab and Resnik 2002;Hwa et al. 2002; Lopez et al. 2002; Yarowsky, Ngai, and Wicentowski 2001; Yarowsky and Ngai 2001; Riloff, Schafer, and Yarowsky 2002).
parallel corpora as a critical resource
• not readily available in the necessary quantities• heavily on French-English
– because the Canadian parliamentary proceedings (Hansards) in English and French were the only large bitext available
– United Nations proceedings (LDC)– religious texts (Resnik, Olsen, and Diab 1999)– software manuals (Resnik and Melamed 1997; Menez
es and Richardson 2001)
• tend to be unbalanced, representing primarily governmental or newswire-style texts
World Wide Web
• People tend to see the Web as a reflection of their own way of viewing the world– a huge semantic network– an enormous historical archive– a grand social experiment
• a great big body of text waiting to be mined• a huge fabric of linguistic data often interwoven
with parallel threads
STRAND
• (Resnik 1998, 1999)– structural translation recognition acquiring nat
ural data– identify pairs of Web pages that are mutual tra
nslations
• Incorporating new work on content-based detection of translations (Smith 2001, 2002)
• efficient exploitation of the Internet Archive
Finding parallel text on the Web
• Location of pages that might have parallel translations
• Generation of candidate pairs that might be translations
• Structural filtering out of nontranslation candidate pairs
Locating Pages
• AltaVista search engine’s advanced search to search for two types of Web pages: parents and siblings.– Ask AV with regular expressions
• (anchor:"english" OR anchor:"anglais")• (anchor:"french" OR anchor:"fran¸cais").
• spider component– The results reported here do not make use of
the spider.
Generating Candidate Pairs
• with URL– http://mysite.com/english/home en.html, on w
hich one combination of substitutions might produce the URLhttp://mysite.com/big5/home ch.html.
• Another possible criterion for matching is the use of document lengths.– text E in language 1 and text F in language 2,
length(E) = C * length(F), where C is a constant tuned for the language pair
Structural Filtering
• linearize the HTML structure and ignore the actual linguistic content of the documents
• align the linearized sequences using a standard dynamic programming technique (Hunt and McIlroy 1975)
STRAND Results (1)
• Recall in this setting is measured relative to the set of candidate pairs that was generated
• Precision– “Was this pair of pages intended to provide th
e same content in the two different languages?”
– Asking the question in this way leads to high rates of interjudge agreement, as measured using Cohen’s µ measure
STRAND Results (2)
• Using the manually set thresholds for dp and n,we have obtained 100% precision and 68.6% recall in an experiment using STRAND to find English-French Web pages (Resnik 1999)
• to obtain English-Chinese pairs and in a similar formal evaluation, we found that the resulting set had 98% precision and 61% recall
Assessing the STRAND Data
• a translation lexicon automatically extracted from the French-English STRAND data could be combined productively with a bilingual French-English dictionary in order to improve retrieval results using a standard cross-language IR test collection (English queries against the CLEF-2000 French collection)– backing off from the dictionary to the STRAND
translation lexicon accounted for over 8% of the lexicon matches (by token)
– reducing the number of untranslatable terms by a third and producing a statistically significant 12% relative improvement in mean average precision as compared to using the dictionary alone
• 30 human-translated sentence pairs from the FBIS (Release 1) English-Chinese parallel corpus, sampled at random.
• 30 Chinese sentences from the FBIS corpus, sampled at random, paired with their English machine translation output from AltaVista’s Babelfishhttp://babel.sh.altavista.com
• 30 paired items from Chinese-English Web data, sampled at random from “sentence-like” aligned chunks as identified using the HTML-based chunk alignment process of STRAND
Optimizing Parameters Using Machine Learning
• Using the English-French data, we constructed a ninefold cross-validation experiment using decision tree induction to predict the class assigned by the human judges– The decision tree software was the widely used C5.0
Content-Based Matching
– Ma and Liberman (1999) point out, not all translators create translated pages that look like the original page
– structure-based matching is applicable only in corpora that include markup
– other applications for translation detection exist
• a generic score of translational similarity that is based upon any word-to-word translation lexicon– Define a CL sim score, tsim
Quantifying Translational Similarity
• (Melamed’s [2000] Method A)– link be a pair (x, y)– maximum-weighted bipartitie matching proble
m O(max(|X|,|Y|)3)
原來的方法只用 Doc 的 structure
現在要用字字間的 sim– 所以要有 lexicon
Exploiting the Internet Archive
• The Internet Archive 120TB (+8TB per month) http://www.archive.org/web/researcher/ is a nonprofit organization attempting to archive the entire publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public.
• Wayback Machine– Temporal database, but not stored in temporal order– Need decompression– Almost data are stored in compressed plain-text files
STRAND on the Archive
• Extracting URLs from index files using simple pattern matching
• Combining the results from step 1 into a single huge list
• Grouping URLs into buckets by handles
• Generating candidate pairs from buckets
URLs similarity
• We arrived at an algorithmically simple solution that avoids this problem but is still based on the idea of language-specific substrings (LSSs).– The idea is to identify a set of language-specif
ic URL substrings that pertain to the two languages of interest (e.g., based on language names, countries, character codeset labels, abbreviations, etc.)
Building an English-Arabic Corpus
• Finding English-Arabic Candidate Pairs on the Internet Archive– 24 top-level national domains
• Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw) …
– 21 .com• emiratesbank.com, checkpoint.com …
Future Work
• Weight in the dictionary can be exploited– IR IDF scores for discerning
• Bootstrapping– Seed to form high-precision initial classifiers
Search Engine Estimate• According to (Dec. 31, 2002) www.searchengineshowdown.com
– Google 3,033 millions– Altavista 1,689 millions 1:1.795
estimates are based on exact counts obtained from AlltheWeb multiplied by the percentage of Relative Size Showdown as compared to the number found by AlltheWeb
• Our estimate for Chinese page count– ASBC 去掉 標點符號 ABC ㄅㄆㄇ 未辨識 等符號
共使用 5915 個 term[ 單字詞 ] ( 共出現 7,844,439 次 )用這 5919 個 term 的中間 2000 名跟 google 和 altavista的 relative count 5021 倍 2181 倍google 是 altavista 的 2.3 倍