Search in Transliterated Space
description
Transcript of Search in Transliterated Space
![Page 1: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/1.jpg)
Search in Transliterated Space
Shared Task Proposal, FIRE 2012
Monojit ChoudhuryMicrosoft Research Lab India
![Page 2: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/2.jpg)
A Transliterated World Wide Web
Song Lyrics
![Page 3: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/3.jpg)
A Transliterated World Wide Web
Reviews and Forums
![Page 4: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/4.jpg)
A Transliterated World Wide Web
Facebook and Twitter
![Page 5: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/5.jpg)
A Transliterated World Wide Web
And lot more
![Page 6: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/6.jpg)
Beyond Indic languages
Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,
Morocco,…) Persian Indian sub-continental languages (IL &
Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)
![Page 7: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/7.jpg)
Aspects of Transliterated Text
Code Mixing
Transliteration
Errors, Contracti
on
![Page 8: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/8.jpg)
IR Scenario - I
Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni
suhanee Results: Only Roman transliterated
documents
Challenge: Spelling variations tandee hawa ye chandny soohaany
![Page 9: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/9.jpg)
IR Scenario - II
Cross-script and Multi-script Monolingual IR in transliterated space
Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी Results: Both Roman transliterated
or in native script Challenge: Transliteration
![Page 10: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/10.jpg)
Scenario - III
Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and
Devanagari) and English documents
![Page 11: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/11.jpg)
Shared Task on Retrieval
Mono-scriptMonolingual
IRTransliterate
d query in Roman
Transliterated documents in Roman
Cross-scriptMonolingual
IRTransliterate
d query in Roman
Transliterated documents in native scriptMulti-script
Monolingual IR
Query in Roman or
native script
Documents in Roman and native scripts
![Page 12: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/12.jpg)
Shared Sub-Tasks
Language identification of transliterated queries, documents, code-mixed text
kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML
Transliteration Forward: കഴിക്കാന് kazhikkan Backward: kazhikkan കഴിക്കാന്
![Page 13: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/13.jpg)
Available Data
20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)
35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics
More data under preparation from FaceBook on mixture of various languages.
Looking for partners to extend!
![Page 14: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/14.jpg)
Available Data
Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics
Looking for partners to extend it to other (Indian) Languages
Other domains?
![Page 15: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/15.jpg)
Thank you! [email protected]
![Page 16: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/16.jpg)
Other resources
Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological
analyzers
Anything else?
![Page 17: Search in Transliterated Space](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816637550346895dd9a3bb/html5/thumbnails/17.jpg)
Concluding Remarks
We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing
These are just some initial ideas that came up from our experiences
If you are interested please let me know