E. Sizikova, I. Ibarra, C. Quaranta, E. Schwartz, R ... · 0931852. We would like to thank Prof....

1
Improving Cross-lingual Search Quality E. Sizikova, I. Ibarra, C. Quaranta, E. Schwartz, R. Bandari, Research in Industrial Projects for Students (RIPS). University of California, Los Angeles (UCLA) Abstract Advancements in Information Retrieval (IR), a field that has become drastically more important in the Information Age, focus primarily on increasing the speed and accuracy of search upon large collections of data. We present several methods for cross-lingual search in the Shoah Foundation Institute Visual History Archive. The nature of our project yielded itself to value the recall performance higher than the precision performance of a given method. Therefore, the majority of the developed techniques focus on obtaining relevant terms from a variety of sources on the web, and expanding our database of results. Sponsor: The USC Shoah Foundation Institute I Established after Schindler’s List. I Stephen Spielberg was getting lots of phone calls. I Over 52,000 interviews of survivors and witnesses of the Holocaust. I Indexed by approximately 62,000 English keywords. I Worldwide access for education. Baseline Method Figure: Our initial idea was translate the labels, and translate the search keyword, then use language specific implementation of Lucene to search. Baseline Method Performance Figure: Baseline method had a clear division between languages that returned successful results majority of the time, and languages that did not. Known problems: 1. Some languages had better stemming and analyzing support than others 2. ’Double translation’ through Google leads to more problem queries 3. Language-specific implementation problems 4. Difficult to set-up for a website that requires quick results SQTM Method Figure: The next method was to directly translate the search query, and use English Lucene to search in the database. Benefits of using SQTM: 1. Only one machine translation is performed: this amounts to less possibility for error. 2. SQTM more easily lends itself to various typical techniques that can be used to improve our results. 3. Every language is passed through the same Lucene version (only English), which provides an easier way to implement the system. 4. The number of languages is limited by the number of languages supported by Google translate or similar software, we are no longer concerned how many languages Lucene stemmers and analyzers can work with. Possible Improvements Spanish: gemelos Cufflinks Arabic: أزرارام أكمTamil: மணிக உறைக German: Manschetten -knöpfe Cuff Links Japanese: カフスボ タン Gemini Italian: Gemelli gemelos Figure: Improving SQTM results is difficult since neither precision not recall is perfect. We need a method that would not increase recall at a big expense of precision. Improving Precision: Filter gemelos Gemini Twin 1. Twin Lucene Gemini Gemini the Twins 3. Gemini Twins Lucene Twins 2. Twins Lucene Cufflinks Cufflinks 1. Cufflinks Lucene cufflink 2. cufflinks Lucene Cuff Links Cuff Links 1. Cuff Links Lucene Output Figure: Irrelevant words are filtered out before the search. Improving Recall: WordNet twins twins Gemini match mate couple ... pair Output Lucene twins Gemini match mate couple ... pair Figure: Each query is first sent to WordNet and then each returned synonym is fed into Lucene to be searched. Comparison of Methods: Improved Recall and Declining Precision Recall Recall and F-measure BL SQTM PD W WPD 0 20 40 60 80 100 Percentage of Queries over .7 Recall Score 20.22 89.88 91.16 91.71 92.26 BL SQTM PD W WPD 0 20 40 60 80 100 Percentage of Queries over .7 Recall Score F2 Score 24.71 87.64 82.32 75.69 68.50 SQTM: All Languages 94% 85% 80% 89% 86% 77% 86% 80% 80% 91% 87% 87% 87% 87% Conclusions and Further Work Our results clearly indicate that SQTM is the best method for solving the specific problem posed by the Shoah Foundation. To see the full stength of pivoting, filtering, and using other corpora, we need to obtain a more complete thesaurus of search terms. For example, one possible improvement to the thesaurus and hence to the search quality to use a wider range of transliterations of words. Acknowledgements This project was jointly supported by the USC Shoah Foundation and NSF Grant # 0931852. We would like to thank Prof. Mike Raugh and Prof. Russel Caflisch for their support at IPAM, as well as Sam Gustman, Leo Hsu, and other staff for their support at the Shoah Foundation. Finally, thanks to Women in Machine Learning (WiML) and their sponsors for sponsoring this poster. Joint work with C. Quaranta, I. Arrieta Ibarra, E. Schwartz, E. Sizikova. Typeset using L A T E X, and beamerposter

Transcript of E. Sizikova, I. Ibarra, C. Quaranta, E. Schwartz, R ... · 0931852. We would like to thank Prof....

Page 1: E. Sizikova, I. Ibarra, C. Quaranta, E. Schwartz, R ... · 0931852. We would like to thank Prof. Mike Raugh and Prof. Russel Ca isch for their support at IPAM, as well as Sam Gustman,

Improving Cross-lingual Search QualityE. Sizikova, I. Ibarra, C. Quaranta, E. Schwartz, R. Bandari,

Research in Industrial Projects for Students (RIPS). University of California, Los Angeles (UCLA)

Abstract

Advancements in Information Retrieval (IR), a field that has become drastically more important in the Information Age, focus primarily on increasing the speed and accuracy of search upon large collections of data. We present several methods for cross-lingualsearch in the Shoah Foundation Institute Visual History Archive. The nature of our project yielded itself to value the recall performance higher than the precision performance of a given method. Therefore, the majority of the developed techniques focus onobtaining relevant terms from a variety of sources on the web, and expanding our database of results.

Sponsor: The USC Shoah Foundation Institute

I Established after Schindler’s List.

I Stephen Spielberg was getting lots of phone calls.

I Over 52,000 interviews of survivors and witnesses of the Holocaust.

I Indexed by approximately 62,000 English keywords.

I Worldwide access for education.

Baseline Method

Figure: Our initial idea was translate the labels, and translate the search keyword, then use languagespecific implementation of Lucene to search.

Baseline Method Performance

Figure: Baseline method had a clear division between languages that returned successful resultsmajority of the time, and languages that did not.

Known problems:

1. Some languages had better stemming and analyzing support than others

2. ’Double translation’ through Google leads to more problem queries

3. Language-specific implementation problems

4. Difficult to set-up for a website that requires quick results

SQTM Method

Figure: The next method was to directly translate the search query, and use English Lucene to searchin the database.

Benefits of using SQTM:

1. Only one machine translation is performed: this amounts to less possibility forerror.

2. SQTM more easily lends itself to various typical techniques that can be used toimprove our results.

3. Every language is passed through the same Lucene version (only English), whichprovides an easier way to implement the system.

4. The number of languages is limited by the number of languages supported byGoogle translate or similar software, we are no longer concerned how manylanguages Lucene stemmers and analyzers can work with.

Possible Improvements

Spanish: gemelos

Cufflinks

Arabic: أزرار أكمام

Tamil:

மணிக்கட்டு

உறைகள்

German: Manschetten

-knöpfe

Cuff Links

Japanese: カフスボタン

Gemini

Italian:

Gemelli

gemelos

Figure: Improving SQTM results is difficult since neither precision not recall is perfect. We need amethod that would not increase recall at a big expense of precision.

Improving Precision: Filter

gemelos

Gemini

Twin 1. Twin

Lucene

Gemini

Gemini the Twins

3. Gemini Twins

Lucene

Twins 2. Twins

Lucene

Cufflinks

Cufflinks 1. Cufflinks

Lucene

cufflink 2. cufflinks

Lucene

Cuff Links Cuff Links 1. Cuff Links

Lucene Output

Figure: Irrelevant words are filtered out beforethe search.

Improving Recall: WordNet

twins

twins

Gemini

match

mate

couple

...

pair

Output Lucene

twins

Gemini

match

mate

couple

...

pair

Figure: Each query is first sent to WordNet andthen each returned synonym is fed into Luceneto be searched.

Comparison of Methods: Improved Recall and Declining Precision

Recall Recall and F-measure

BL SQTM PD W WPD0

20

40

60

80

100

Perc

enta

ge o

f Q

ueries o

ver

.7

Recall Score

20.22

89.88 91.16 91.71 92.26

BL SQTM PD W WPD0

20

40

60

80

100

Perc

enta

ge o

f Q

ueries o

ver

.7

Recall ScoreF2 Score

24.71

87.6482.32

75.69

68.50

SQTM: All Languages

94% 85% 80% 89% 86% 77% 86% 80% 80% 91% 87% 87% 87% 87%

Conclusions and Further Work

Our results clearly indicate that SQTM is the best method for solving the specificproblem posed by the Shoah Foundation. To see the full stength of pivoting, filtering,and using other corpora, we need to obtain a more complete thesaurus of searchterms. For example, one possible improvement to the thesaurus and hence to thesearch quality to use a wider range of transliterations of words.

Acknowledgements

This project was jointly supported by the USC Shoah Foundation and NSF Grant #0931852. We would like to thank Prof. Mike Raugh and Prof. Russel Caflisch fortheir support at IPAM, as well as Sam Gustman, Leo Hsu, and other staff for theirsupport at the Shoah Foundation. Finally, thanks to Women in Machine Learning(WiML) and their sponsors for sponsoring this poster.

Joint work with C. Quaranta, I. Arrieta Ibarra, E. Schwartz, E. Sizikova. Typeset using LATEX, and beamerposter