Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work...
-
Upload
myron-gibson -
Category
Documents
-
view
212 -
download
0
Transcript of Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work...
![Page 1: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/1.jpg)
Finding parallel texts on the web using cross-language information
retrievalAchim Ruopp
Joint work with Fei XiaUniversity of Washington
![Page 2: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/2.jpg)
An early parallel text
![Page 3: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/3.jpg)
Uses for Parallel Corpora
• Parallel corpora are valuable resources for natural language processing (NLP)– Machine translation– Cross-lingual information retrieval (IR)
• E.g. PanImages from the University of Washington• Cross-lingual image search system• http://www.panimages.org/
– Computer Aided Human Translation– Monolingual NLP via information projection– …
![Page 4: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/4.jpg)
What is a Parallel Text or Parallel Corpus?
• Translated text/documents in two languages• Ideally sentence-aligned (e.g. using method from
Gale & Church 1993)
![Page 5: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/5.jpg)
Examples for Parallel Corpora
• EUROPARL - European parliament proceedings– 10 language pairs– About 44 million words/language
• Canadian parliament proceedings (Hansard)– English – French
• Software documentation in multiple languages
• …
![Page 6: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/6.jpg)
Motivation
• Problem: Parallel corpora exist only for a limited set of language pairs
• Problem: Available parallel corpora are often very domain-specific
• Problem: Available parallel corpora are often small
• Task: Finding parallel texts on the Web
![Page 7: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/7.jpg)
Example & Walk-Through
Previous work:• Ma and Liberman (1999)• Chen and Nie (2000)• Resnik and Smith (2003)
7
![Page 8: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/8.jpg)
Main Steps in Identifying Parallel Text on the Web (Resnik and Smith, 2003)
1. Locating pages that might have parallel translations
2. Generating candidate page pairs that might be translations
3. Filtering out of non-translation candidate pairs
8
![Page 9: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/9.jpg)
Our approach
1. Locating pages that might have parallel translations:
Sampling by sending queries
2. Generating candidate page pairs that might be translations:
Comparing URLs with different matching methods
3. Filtering out of non-translation candidate pairs: Combining structural and content-based filtering
![Page 10: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/10.jpg)
System Overview
Sampling Language L1
Checking Language L2
Match
Filter
Web
![Page 11: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/11.jpg)
Outline
• System description(1a) Sampling the source language L1(1b) Checking pages in the target language L2(2) Matching pages in L1 and L2(3) Filtering page pairs
• Experiments• Conclusion and future work
![Page 12: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/12.jpg)
(1a) One-term Sample
• Sample– Search engine query of one term– Limited to pages in source language– Optional parameter: inurl:<2-letter language ID>– Submitted to search engine API– Search engine does automatic stemming– 100 pages in result set
![Page 13: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/13.jpg)
(1a) Choosing Terms
• Dictionary– Built using Giza++ word alignment tool– Trained on years 2001-2003 of the Europarl corpus– Contains IBM Model 1 translation probabilities
• Sampling term– Selected from source language vocabulary– Mid-frequency term
• Selected at random using a normal distribution• Goal: Avoid domain-specificity
![Page 14: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/14.jpg)
(1a) Source Language Expansion• From one-term to n-term queries• Common IR query expansion technique• Based on page summaries returned by the one-term
sampling query• Summary terms ranked by frequency• Leads to semantically related terms because of
relevancy ranking of search engine results“shannon” → “information claude”“inconveniences” → “security travelers”
• Original term expanded with one or more expansion terms re-submitted to search engine
![Page 15: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/15.jpg)
(1b) Checking Query
• Sampling query terms translated using the Giza++ dictionary“inconvenience security travelers” →
“unannehmlichkeit sicherheit”
• m-best translations of n sampling terms lead to mn checking queries
• Optional parameter: inurl:<2-letter language ID>
![Page 16: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/16.jpg)
(1b) Target language expansion
• Alternative to translating a complete n-term sampling query
• Only translate original one-term sample• Expand on target language side equivalently
as on source language side• m checking queries instead of mn
• Efficiency vs. source language expansion evaluated in experiments
![Page 17: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/17.jpg)
(1b) “site:” Parameter
• Optional: site parameter– Allows sites retrieved in checking query to be
restricted to sites returned in sampling query– Search engine limits to sites of first 30 sampling
query page results
![Page 18: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/18.jpg)
(2) Matching URLs with Fixed Language List
• URLs from corresponding sampling and checking result sets
• Considered a match if they only differ in a in a fixed list of language IDs
en deen-us de-de
en geenu deuenu ger
english germanenglisch deutsch
![Page 19: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/19.jpg)
(2) Matching URLs with Levenshtein Distance
• Levenshtein distance– Also known as “edit distance”– URLs from corresponding sampling and checking
result sets– Considered a match when URLs have a Levenshtein
distance less or equal than 4, but larger than 0
http://ec.europa.eu/education/policies/rec_qual/recognition/diploma_en.html http://ec.europa.eu/education/policies/rec_qual/recognition/diploma_de.html
![Page 20: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/20.jpg)
(2) URL part substitution
• Sampling L1 source URLs• Replacing L1 names/ids in each source URL
with L2 names/ids target URLs• Checking whether the target URLs exist• Does not require checking queries!
![Page 21: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/21.jpg)
(3) Filtering page pairs
• Structural filtering (Resnik and Smith, 2003)• Content translation metric (Ma and Liberman,
1999)• Linear combination
![Page 22: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/22.jpg)
(3) Linearization
Linearized File
[START:HTML][START:HEAD][START:META]
[Chunk: 12][END:META]
[START:TITLE][Chunk:25]
[END:TITLE][END:HTML]
<HTML> <HEAD> <META> … </META>
<TITLE> ….. </TITLE> </HTML>
HTML file
![Page 23: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/23.jpg)
(3) Alignment
Linearized Source File[START:HTML][START:HEAD][START:META]
[END:META]
[START:TITLE][Chunk:58]
[END:TITLE][START:META]
…
Linearized Target File[START:HTML][START:HEAD][START:META][END:META][START:LINK][END:LINK][START:TITLE][Chunk:68][END:TITLE][START:META]…
![Page 24: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/24.jpg)
(3) Structural Metrics
• Difference percentage (dp)– Measures how different markup in linearized files is– Based on longest common subsequence algorithm
(Hunt & McIlroy 1976)– Implemented using diff tool
• Length correlation of aligned non-markup chunks (r)– Pearson correlation coefficient over all aligned chunks
in a file pair– Length of content in characters
![Page 25: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/25.jpg)
(3) Content Translation Metric
• Calculated on first 500 content words on page• Using the Giza++ translation dictionary
121 ),(
pinTokensOfNum
PairsTokennTranslatioOfNumppc
![Page 26: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/26.jpg)
(3) Combining Two Kinds of Metrics
• Structural metrics: dp and r
• Content-based metric: c
• Linear combination:
3
),(*),(*)),(1(*),( 212121
21
ppcapprappdpappt crdp
dprc
![Page 27: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/27.jpg)
Different settings for experiments
(1a) Sampling the source language L1– Source expansion – The “inurl:” parameter
(1b) Checking pages in the target language L2– Target expansion– The “inurl:” and “site:” parameter
(2)Matching pages in L1 and L2– Using a fixed list– Edit distance– URL part substitution
27
![Page 28: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/28.jpg)
Experiment Results – Matches
![Page 29: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/29.jpg)
Observations – Sampling and Checking
• Query expansion increases the number of page pairs– Source and target query expansion lead to similar
results– Difference between n=2 and n=3 is not significant
• Possible explanation: Larger semantic divergence of queries on the source and target language sides
• Using site: and inurl: search parameters increases the number of discovered page pairs– But: structural parameters might miss candidate pairs
that don’t follow pattern
![Page 30: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/30.jpg)
Observations – Matching
• Number of page pair candidates– URL part substitution >> Levensthein distance– Levenshtein distance > Fixed language list
• Matching methods that use checking queries are heavily impacted by relevancy rankings
• Levenshtein distance matching method– Allows learning of URL patterns used for parallel
pages
![Page 31: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/31.jpg)
Experiment Results – Filtering
![Page 32: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/32.jpg)
Observations – Filtering
• Combined filter – Evaluated in comparison to human judge– Precision “How many did we get right?”
• 88.9% • Encouraging on noisy test set
– Recall “How many did we miss?”
• 36.4%• Low recall can be compensated for by submitting more queries
positivesfalsepositivestrue
positivestrueprecision
__
_
negativesfalsepositivestrue
positivestruerecall
__
_
![Page 33: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/33.jpg)
Conclusions
• It is possible reliably gather parallel pages using commercial search engines– Even though there are no standard features
identifying these pages– Despite the relevance ranking of commercial
search engine results
![Page 34: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/34.jpg)
Future work
• To improve the precision and recall of the filtering step.
• To address the relevancy ranking and the page limit problem
• To study whether some queries are more productive than others
• To test the usefulness of the collected page pairs on applications such as MT.
34
![Page 35: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/35.jpg)
Additional Slides
![Page 36: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/36.jpg)
Experiments & ResultsExperiment ID
Expansion type
Query length(n)
inurl: Param
site:Param
Number of page pairs(before filtering)
Number of page pairs (after filtering)
List Levenshtein Substitution List Levenshtein Substitution
1 none 1 513 1108 1 1 97
2 Source 2 828 1889 3 4 157
3 Source 3 1042 1975 1 10 124
4 none 1 en/de 5884 5083 17 22 285
5 Source 2 en/de 72132 9279 27 31 433
6 Source 3 en/de 100160 9200 25 31 347
7 none 1 618 1099 1 3 92
8 Target 2 424 1771 2 3 143
9 Target 3 412 1761 0 0 149
10 none 1 en/de 5693 5041 24 34 281
11 Target 2 en/de 107161 9131 27 33 426
12 Target 3 en/de 4572 8395 12 15 335
13 none 1 30 10258
n/a6 9
n/a
14 Source 2 30 22743
n/a9 32
n/a
15 Source 3 30 461074
n/a12 41
n/a
16 none 1 en/de 30 59164
n/a13 15
n/a
17 Source 2 en/de 30 118442
n/a28 50
n/a
18 Source 3 en/de 30 171693
n/a46 49
n/a
![Page 37: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/37.jpg)
Languages on the Web
Source: http://www.glreach.com/globstats/index.php3
![Page 38: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/38.jpg)
Questions Asked
1. How do we find parallel pages in the sea of mostly monolingual pages?
2. What is the share of parallel pages for a given language pair?
![Page 39: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/39.jpg)
Estimating the Percentage of Parallel Pages for a Language Pair
E
E ∩ F
F
ESize
FSizeEFP E
E | FSize
ESizeFEP F
F |
P(DE|E)=0.03% P(ED|D)=0.27%
![Page 40: Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.](https://reader038.fdocuments.in/reader038/viewer/2022110401/56649e235503460f94b119fa/html5/thumbnails/40.jpg)
References
• IJCNLP 2008 paper and presentation– http://search.iiit.ac.in/CLIA2008/accepted_papers
.php
• Email– mailto:[email protected]
• MSR internship information– http://research.microsoft.com/aboutmsr/jobs/int
ernships/default.aspx