English-Marathi Cross Language Information Retrieval … · · 2017-02-01Languages namely, Hindi,...
Transcript of English-Marathi Cross Language Information Retrieval … · · 2017-02-01Languages namely, Hindi,...
Kalyani Lokhande,
Research Scholar,
Department of Computer Engineering,
SSBT’s COET, Jalgaon.
Dhanashree Tayade,
Asst. Profesoor
Department of Computer Engineering,
SSBT’s COET, Jalgaon.
Abstract— Today, different types of contents in different
languages are available on World Wide Web and their usage is
increasing rapidly. Cross Language Information Retrieval
(CLIR) deals with retrieval of documents in another language
than the language of the requested query. Various researchers
worked in Cross Language Information Retrieval systems for
Indian languages. CLIR allows to write query in user’s native
language. But sometimes it is difficult for a user to write a
request in a language which can be easily read and understand.
In the proposed system, English to Marathi Cross Language
Information Retrieval is designed by using query translation
approach. The query language will be in English and documents
will be retrieved from Marathi documents collection. English
query translation will be done dictionary based approach. To
improve performance of the proposed system, query expansion
using WordNet and pre-processing techniques will be employed
to get good precision and recall.
I. INTRODUCTION
Information Retrieval systems since developed has opened
doors of knowledge across the world. Initially IR systems
were predominantly developed for very few or say one
language. Language was the barrier for users. Introduction of
Cross Language Information Retrieval systems has opened
new paradigm for efficient and easy retrieval of information
in different languages.
The evaluation of Cross Language Information Retrieval
for Indian languages started recently. After the highly
successful CLEF and NTCIR campaigns, since 2008, the
Forum for Information Retrieval Evaluation [FIRE], modeled
focused specifically on Indian languages and English.
Document collections have been developed for some Indian
Languages namely, Hindi, Bangla, Marathi and English.
The number of Internet users increasing day to day
accessing any kind of required information at any time.
Information Retrieval (IR) mainly refers to a process that the
finding required information. With 100 million internet users,
India is at third place globally in usage of internet. Though the
internet has shrunken the geographical boundaries, the
language diversification is a big barrier to get full benefit of
the internet. Hence there is a need to develop a technique like
Cross Language Information Retrieval which is used to
retrieve documents in a language other than the user used to
specify the query. Therefore, Internet is no longer
monolingual and non-English contents are accessed rapidly.
In this aspect, Information retrieved is mainly considered in
text form [1].
CLIR is an area of IR that has a lot to be explored. The
goal of CLIR is to allow users to make queries in one language
and retrieve documents in one or more other languages. The
resulting documents can then be translated into the language
used for the query or any other language of user’s interest [2].
In India people are speaking different local languages. Only
a very few percentage of population know English language
and they can express their queries in English in a right way.
Even though it is done right way there are chances of poor
result compared with monolingual. For example, user wants
information about Marathi Abhangas if the query fired in
English (like Abhang) the retrieved documents shows poor
result compared when query fired in Marathi language itself.
Cross Lingual Information Retrieval provides the solution for
language barrier, by allowing the user to ask the query in the
local language and then to get the documents in another
language (English) and vice versa [3].
English-Marathi Cross Language Information Retrieval System
Based On Query Translation Approach
Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254
IJCSCN | December-January 2016-2017 Available [email protected]
250
ISSN:2249-5789
An approach which seems promising for foreign languages
is not necessarily work for Indian languages. The most
commonly used vocabulary in Indian language documents
found on the web contains a number of words that have
Sanskrit, Persian or English origin [4].
In proposed English-Marathi Cross Language Information
Retrieval system, among major translation approaches, query
translation approach is used. Translation query will be done
by using bilingual dictionary or corpus. Information will be
retrieved by using IR algorithm. The performance will be
improved by using query expansion technique using
WordNet. The experiment will be performed on standard
dataset in Marathi.
II. RELATED WORK
D.Thenmozhi and C.Aravindan, in [5] presents a Tamil-
English Cross Lingual Information Retrieval System for
Agriculture Society. A CLIR system is being developed in
Agriculture domain for the Farmers of Tamil Nadu which
helps them to specify their query in Tamil and to retrieve the
documents in English. This Machine Translation approach
retrieves the pages with Mean Average Precision of 95%. The
recall value is also considerably improved.
Debasis Mandal et al., in [6] proposed a Bengali and Hindi
to English CLIR system. When query given in two most
widely spoken Indian languages, Hindi and Bengali the
system retrieves English documents. An automatic Query
Generation and Machine Translation approach is used. The
results depict that there is the need of good language-specific
resources having a rich bilingual lexicon.
Saurabh Varshney and Jyoti Bajpai, in [7] proposed an
algorithm for improving the performance of the English-
Hindi CLIR system. By using all possible combination of
Hindi translated query and transliteration of English query
terms and the best query among them was chosen for retrieval
of documents. The experimental results show that the
proposed approach helps to resolve ambiguity in English-
Hindi CLIR system and gives more relevant information as
compared English monolingual.
Manoj Kumar et al., in [8] presents a Hindi and Marathi to
English CLIR system. They used a query based translation
approach using dictionaries. By using a simple rule based
translation technique, query words which were not found in
the dictionary are translated. The resultant translation is then
compared with the unique words of the corpus and returns the
"k" words most similar to it.
Mallamma V Reddy et al., in [9] proposed Kannada and
Telugu Native languages to English cross language
information retrieval as part of Ad-Hoc Bilingual task. By
using bi-lingual dictionaries, queries were translated. By
using an iterative page-rank style algorithm the resulting
multiple translation choices for each query word are
disambiguated and then produces the final translated query.
Pattabhi R. et al., in [10], proposed Cross Lingual
Information Retrieval between Tamil and English languages.
A Tamil – English bilingual dictionary, was used for the
translation of the query and statistical method using n-grams
based approach was referred for the transliteration. WordNet
was used for query expansion. By using Okapi BM25
Ranking, results were improved.
Jagadeesh Jagarlamudi and A Kumaran, in [11], proposed
a Cross-Lingual Information Retrieval System for Indian
Languages in CLEF 2007. The queries were given in non-
English language including Hindi, Telugu, Bengali, Marathi
and Tamil and the documents were provided in English. The
system had 1000 relevant documents. They organized the
magazine “Los Angeles Times” as the domain which includes
1,35,153 English news articles from 2002. The results show
that performance was about 73% of monolingual system.
Karush Arora et al., in [12], proposed a Cross Lingual
Information Retrieval system which showed efficiency
improvement through rule based transliteration. They took
tourism as the domain and used a Bilingual Dictionary having
Punjabi and Hindi documents.
Saraswathi et al., in [13], proposed a Bilingual
Information Retrieval System for English and Tamil for the
Festival domain. Ontological tree is used for their analysis
and keyword retrieval. It requires only single mapping from
Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254
IJCSCN | December-January 2016-2017 Available [email protected]
251
ISSN:2249-5789
any language to any other language. The other tasks such as
keywords language identification and sub keyword extraction
the proposed system can be used. Total of 200 documents
were collected for both the languages. A generic platform is
built for bilingual IR which can be extended to any foreign or
Indian language working with the same efficiency.
Anurag et al., in [14] proposed an English-Hindi Cross
Language Information Retrieval (CLIR) system by using
Managing Gigabytes (MG) retrieval systems as the base IR
engine. Hindi test collection was created for this research
along with relevance judgement. The queries were translated
using different strategies. Use of NLP techniques can improve
the performance.
As per study of previous work, it is found that for CLIR
systems are developed for many of the languages. English-
Marathi CLIR will be new invention in CLIR field of Indian
languages. Among translation approaches, Query translation
approach has been adapted by most of the authors. For query
translation, bilingual dictionary and machine translation
systems are widely used being easier approach. However,
new approaches like Corpus and Ontology proves promising
if it used for specific domain. Experiment setup is mostly on
standard dataset. The performance should be measured on
self-created dataset for a particular domain. Techniques such
as query pre-processing and query expansion helps to
improve performance on overall system.
III. PROBLEM DEFINITION
Information Retrieval systems since developed has
opened doors of knowledge across the world. Initially IR
systems were predominantly developed for very few or say
one language. Language has been barrier for users since the
content was restricted to few languages. Since a last decade,
the content on web is coming from different native languages
of user. This leads to introduction of Cross Language
Information Retrieval System (CLIR). Users are unable to
write a request in a native language which can be easily read
and understand by them. CLIR permits the user to retrieve
the documents in other language than the query language.
India being multilingual country, there is a wide scope for
CLIR for Indian languages. Some research work has been
done in CLIR for Indian languages. However, there is scope
of improvement in existing systems and inventing new
systems for remaining languages. Marathi is the language
spoken primarily by the native people of Maharashtra, a state
of India. The proposed English-Marathi Cross Language
Information Retrieval system will allow users to write a query
in English and retrieve documents in Marathi language.
A. Objectives of Proposed System
To develop English-Marathi Cross Language
Information Retrieval system by using query
translation approach.
To improve Precision and Recall of the system.
IV. PROPOSED WORK
The framework of proposed approach is described in
Figure 1 The proposed framework shows the working of
English-Marathi CLIR system in which user gives their query
in English language and the relevant documents are retrieved
in Marathi language. The documents will be used from FIRE
2010 Dataset of Marathi news corpus. The query will be
translated by using Sata-Anuvadak resources [15].
The steps involved in flow of the proposed system is given
below:
1. Firstly user enters the query in English language.
2. By using pre-query expansion we expand the English query
by using various tools like English WordNet.
3. Query translation translate refined English query to Marathi
query with query translation approach by using Sata-
Anuvadak resources.
Figure 1.Proposed System Architecture
Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254
IJCSCN | December-January 2016-2017 Available [email protected]
252
ISSN:2249-5789
4. Post-query expansion expands the Marathi query by using
Marathi WordNet.
5. This expanded Marathi query is fired to retrieve Marathi
relevant documents based on similarity between query and
documents.
V. CONCLUSION
Cross-lingual IR provides new paradigms in searching
documents through varieties of languages across the world.
CLIR for Indian languages has gained importance in last
decade and there is scope to explore much in this field.
Observation shows that there is a scope of improvement in the
performance level of CLIR. In this work an improved
English-Marathi based CLIR is proposed. The proposed
English-Marathi CLIR system will allow users to write a
query in English and to retrieve Marathi documents. Even
non-Marathi user can translate documents in own native
language.
Table 1. Comparison of Different CLIR Systems and Their Approaches
Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254
IJCSCN | December-January 2016-2017 Available [email protected]
253
ISSN:2249-5789
VI. REFERENCES
[1] P. Iswarya, Dr. V. Radha , International Journal Of Engineering
Research And Applications,"Cross Language Text Retrieval: A
Review" (IJERA) ISSN: 2248-9622 Vol.2, Issue 5, September- October
2012, pp.1036-1043
[2] Pothula Sujatha and P. Dhavachelvan, "A Review on the Cross and
Multilingual Information Retrieval"
[3] International Journal of Web & Semantic Technology (IJWesT) Vol.2,
No. 4, October 2011,DOI : 10. 5121/ijwest.2011.2409 115
[4] A. Nagarathinam,Dr. S. Saraswathi, "State of Art: Cross Lingual
Information Retrieval System for Indian Languages",International
Journal of Computer Applications (0975 – 8887) Volume 35– No.13,
December 2011.
[5] D. Thenmozhi, C. Aravindan “Tamil-English Cross Lingual
Information Retrieval System for Agriculture Society”.
[6] D. Mandal, S. Dandapat, M. Gupta, P. Banerjee, S. Sarkar, “Bengali
and Hindi to English Cross-language Text Retrieval under Limited
Resources”, At the 8th Workshop of the Cross-Language Evaluation
Forum, Budapest, Hungary, 19-21 September 2007.
[7] S. Varshney, J. Bajpai, “Improving performance of English-Hindi
Cross Language Information Retrieval using Transliteration of query
terms” 2013 IEEE International Conference in MOOC, Innovation and
Technology in Education (MITE), 978-1-4799-1626-9/13/2013 IEEE.
[8] Manoj Kumar Chinnakotla, Sagar Ranadive, Om P. Damani and
Pushpak Bhattacharyya,” Hindi to English and Marathi to English
Cross Language Information Retrieval Evaluation”.
[9] Mallamma V Reddy, Dr. M. Hanumanthappa, “ Kannada and Telugu
Native Languages to English Cross Language Information Retrieva ,”
International Journal of Computer Science and Information
Technologies, Vol.2.
[10] Pattabhi R. K. Rao. , and Sobha, L. Cross Lingual Information Retrieval
Track: Tamil – English, Working notes from FIRE 2010, Feb 2010.
[11] Jagadeesh Jagarlamudi and Kumaran, A. Cross-Lingual Information
Retrieval System for Indian Languages, Proceedings of CLEF 2007,
2007.
[12] Karunesh Arora, Ankur Garg, Gour Mohan, Somiram Singla, Chander
Mohan. Cross Lingual Information Retrieval Efficiency Improvement
through Transliteration, Proceedings of ASCNT 2009, 65-71, 2009.
[13] Dr. Saraswathi, S., Asma Siddhiqaa, M., Kalaimagal, K., and
Kalaiyarasi M. BiLingual Information Retrieval System for English and
Tamil, Journal Of Computing, 2,4, 85-89, April 2010.
[14] Anurag Seetha,Sujoy Das, M. Kumar,"Evaluation of the English-Hindi
Cross Language Information Retrieval System Based on Dictionary
Based Query Translation Method" 10th International Conference on
Information Technology,0-7695-3068-0/07,2007, IEEE DOI,
10.1109/ICIT.2007. 53
[15] Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah,
Pushpak Bhattacharyya “Sata-Anuv _ adak : Tackling Multiway
Translation of Indian Languages”LREC 2014
Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254
IJCSCN | December-January 2016-2017 Available [email protected]
254
ISSN:2249-5789