English-Marathi Cross Language Information Retrieval … ·  · 2017-02-01Languages namely, Hindi,...

5
Kalyani Lokhande, Research Scholar, Department of Computer Engineering, SSBT’s COET, Jalgaon. Dhanashree Tayade, Asst. Profesoor Department of Computer Engineering, SSBT’s COET, Jalgaon. AbstractToday, different types of contents in different languages are available on World Wide Web and their usage is increasing rapidly. Cross Language Information Retrieval (CLIR) deals with retrieval of documents in another language than the language of the requested query. Various researchers worked in Cross Language Information Retrieval systems for Indian languages. CLIR allows to write query in user’s native language. But sometimes it is difficult for a user to write a request in a language which can be easily read and understand. In the proposed system, English to Marathi Cross Language Information Retrieval is designed by using query translation approach. The query language will be in English and documents will be retrieved from Marathi documents collection. English query translation will be done dictionary based approach. To improve performance of the proposed system, query expansion using WordNet and pre-processing techniques will be employed to get good precision and recall. I. INTRODUCTION Information Retrieval systems since developed has opened doors of knowledge across the world. Initially IR systems were predominantly developed for very few or say one language. Language was the barrier for users. Introduction of Cross Language Information Retrieval systems has opened new paradigm for efficient and easy retrieval of information in different languages. The evaluation of Cross Language Information Retrieval for Indian languages started recently. After the highly successful CLEF and NTCIR campaigns, since 2008, the Forum for Information Retrieval Evaluation [FIRE], modeled focused specifically on Indian languages and English. Document collections have been developed for some Indian Languages namely, Hindi, Bangla, Marathi and English. The number of Internet users increasing day to day accessing any kind of required information at any time. Information Retrieval (IR) mainly refers to a process that the finding required information. With 100 million internet users, India is at third place globally in usage of internet. Though the internet has shrunken the geographical boundaries, the language diversification is a big barrier to get full benefit of the internet. Hence there is a need to develop a technique like Cross Language Information Retrieval which is used to retrieve documents in a language other than the user used to specify the query. Therefore, Internet is no longer monolingual and non-English contents are accessed rapidly. In this aspect, Information retrieved is mainly considered in text form [1]. CLIR is an area of IR that has a lot to be explored. The goal of CLIR is to allow users to make queries in one language and retrieve documents in one or more other languages. The resulting documents can then be translated into the language used for the query or any other language of user’s interest [2]. In India people are speaking different local languages. Only a very few percentage of population know English language and they can express their queries in English in a right way. Even though it is done right way there are chances of poor result compared with monolingual. For example, user wants information about Marathi Abhangas if the query fired in English (like Abhang) the retrieved documents shows poor result compared when query fired in Marathi language itself. Cross Lingual Information Retrieval provides the solution for language barrier, by allowing the user to ask the query in the local language and then to get the documents in another language (English) and vice versa [3]. English-Marathi Cross Language Information Retrieval System Based On Query Translation Approach Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254 IJCSCN | December-January 2016-2017 Available [email protected] 250 ISSN:2249-5789

Transcript of English-Marathi Cross Language Information Retrieval … ·  · 2017-02-01Languages namely, Hindi,...

Page 1: English-Marathi Cross Language Information Retrieval … ·  · 2017-02-01Languages namely, Hindi, Bangla, Marathi and English. The number of Internet users increasing day to day

Kalyani Lokhande,

Research Scholar,

Department of Computer Engineering,

SSBT’s COET, Jalgaon.

Dhanashree Tayade,

Asst. Profesoor

Department of Computer Engineering,

SSBT’s COET, Jalgaon.

Abstract— Today, different types of contents in different

languages are available on World Wide Web and their usage is

increasing rapidly. Cross Language Information Retrieval

(CLIR) deals with retrieval of documents in another language

than the language of the requested query. Various researchers

worked in Cross Language Information Retrieval systems for

Indian languages. CLIR allows to write query in user’s native

language. But sometimes it is difficult for a user to write a

request in a language which can be easily read and understand.

In the proposed system, English to Marathi Cross Language

Information Retrieval is designed by using query translation

approach. The query language will be in English and documents

will be retrieved from Marathi documents collection. English

query translation will be done dictionary based approach. To

improve performance of the proposed system, query expansion

using WordNet and pre-processing techniques will be employed

to get good precision and recall.

I. INTRODUCTION

Information Retrieval systems since developed has opened

doors of knowledge across the world. Initially IR systems

were predominantly developed for very few or say one

language. Language was the barrier for users. Introduction of

Cross Language Information Retrieval systems has opened

new paradigm for efficient and easy retrieval of information

in different languages.

The evaluation of Cross Language Information Retrieval

for Indian languages started recently. After the highly

successful CLEF and NTCIR campaigns, since 2008, the

Forum for Information Retrieval Evaluation [FIRE], modeled

focused specifically on Indian languages and English.

Document collections have been developed for some Indian

Languages namely, Hindi, Bangla, Marathi and English.

The number of Internet users increasing day to day

accessing any kind of required information at any time.

Information Retrieval (IR) mainly refers to a process that the

finding required information. With 100 million internet users,

India is at third place globally in usage of internet. Though the

internet has shrunken the geographical boundaries, the

language diversification is a big barrier to get full benefit of

the internet. Hence there is a need to develop a technique like

Cross Language Information Retrieval which is used to

retrieve documents in a language other than the user used to

specify the query. Therefore, Internet is no longer

monolingual and non-English contents are accessed rapidly.

In this aspect, Information retrieved is mainly considered in

text form [1].

CLIR is an area of IR that has a lot to be explored. The

goal of CLIR is to allow users to make queries in one language

and retrieve documents in one or more other languages. The

resulting documents can then be translated into the language

used for the query or any other language of user’s interest [2].

In India people are speaking different local languages. Only

a very few percentage of population know English language

and they can express their queries in English in a right way.

Even though it is done right way there are chances of poor

result compared with monolingual. For example, user wants

information about Marathi Abhangas if the query fired in

English (like Abhang) the retrieved documents shows poor

result compared when query fired in Marathi language itself.

Cross Lingual Information Retrieval provides the solution for

language barrier, by allowing the user to ask the query in the

local language and then to get the documents in another

language (English) and vice versa [3].

English-Marathi Cross Language Information Retrieval System

Based On Query Translation Approach

Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254

IJCSCN | December-January 2016-2017 Available [email protected]

250

ISSN:2249-5789

Page 2: English-Marathi Cross Language Information Retrieval … ·  · 2017-02-01Languages namely, Hindi, Bangla, Marathi and English. The number of Internet users increasing day to day

An approach which seems promising for foreign languages

is not necessarily work for Indian languages. The most

commonly used vocabulary in Indian language documents

found on the web contains a number of words that have

Sanskrit, Persian or English origin [4].

In proposed English-Marathi Cross Language Information

Retrieval system, among major translation approaches, query

translation approach is used. Translation query will be done

by using bilingual dictionary or corpus. Information will be

retrieved by using IR algorithm. The performance will be

improved by using query expansion technique using

WordNet. The experiment will be performed on standard

dataset in Marathi.

II. RELATED WORK

D.Thenmozhi and C.Aravindan, in [5] presents a Tamil-

English Cross Lingual Information Retrieval System for

Agriculture Society. A CLIR system is being developed in

Agriculture domain for the Farmers of Tamil Nadu which

helps them to specify their query in Tamil and to retrieve the

documents in English. This Machine Translation approach

retrieves the pages with Mean Average Precision of 95%. The

recall value is also considerably improved.

Debasis Mandal et al., in [6] proposed a Bengali and Hindi

to English CLIR system. When query given in two most

widely spoken Indian languages, Hindi and Bengali the

system retrieves English documents. An automatic Query

Generation and Machine Translation approach is used. The

results depict that there is the need of good language-specific

resources having a rich bilingual lexicon.

Saurabh Varshney and Jyoti Bajpai, in [7] proposed an

algorithm for improving the performance of the English-

Hindi CLIR system. By using all possible combination of

Hindi translated query and transliteration of English query

terms and the best query among them was chosen for retrieval

of documents. The experimental results show that the

proposed approach helps to resolve ambiguity in English-

Hindi CLIR system and gives more relevant information as

compared English monolingual.

Manoj Kumar et al., in [8] presents a Hindi and Marathi to

English CLIR system. They used a query based translation

approach using dictionaries. By using a simple rule based

translation technique, query words which were not found in

the dictionary are translated. The resultant translation is then

compared with the unique words of the corpus and returns the

"k" words most similar to it.

Mallamma V Reddy et al., in [9] proposed Kannada and

Telugu Native languages to English cross language

information retrieval as part of Ad-Hoc Bilingual task. By

using bi-lingual dictionaries, queries were translated. By

using an iterative page-rank style algorithm the resulting

multiple translation choices for each query word are

disambiguated and then produces the final translated query.

Pattabhi R. et al., in [10], proposed Cross Lingual

Information Retrieval between Tamil and English languages.

A Tamil – English bilingual dictionary, was used for the

translation of the query and statistical method using n-grams

based approach was referred for the transliteration. WordNet

was used for query expansion. By using Okapi BM25

Ranking, results were improved.

Jagadeesh Jagarlamudi and A Kumaran, in [11], proposed

a Cross-Lingual Information Retrieval System for Indian

Languages in CLEF 2007. The queries were given in non-

English language including Hindi, Telugu, Bengali, Marathi

and Tamil and the documents were provided in English. The

system had 1000 relevant documents. They organized the

magazine “Los Angeles Times” as the domain which includes

1,35,153 English news articles from 2002. The results show

that performance was about 73% of monolingual system.

Karush Arora et al., in [12], proposed a Cross Lingual

Information Retrieval system which showed efficiency

improvement through rule based transliteration. They took

tourism as the domain and used a Bilingual Dictionary having

Punjabi and Hindi documents.

Saraswathi et al., in [13], proposed a Bilingual

Information Retrieval System for English and Tamil for the

Festival domain. Ontological tree is used for their analysis

and keyword retrieval. It requires only single mapping from

Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254

IJCSCN | December-January 2016-2017 Available [email protected]

251

ISSN:2249-5789

Page 3: English-Marathi Cross Language Information Retrieval … ·  · 2017-02-01Languages namely, Hindi, Bangla, Marathi and English. The number of Internet users increasing day to day

any language to any other language. The other tasks such as

keywords language identification and sub keyword extraction

the proposed system can be used. Total of 200 documents

were collected for both the languages. A generic platform is

built for bilingual IR which can be extended to any foreign or

Indian language working with the same efficiency.

Anurag et al., in [14] proposed an English-Hindi Cross

Language Information Retrieval (CLIR) system by using

Managing Gigabytes (MG) retrieval systems as the base IR

engine. Hindi test collection was created for this research

along with relevance judgement. The queries were translated

using different strategies. Use of NLP techniques can improve

the performance.

As per study of previous work, it is found that for CLIR

systems are developed for many of the languages. English-

Marathi CLIR will be new invention in CLIR field of Indian

languages. Among translation approaches, Query translation

approach has been adapted by most of the authors. For query

translation, bilingual dictionary and machine translation

systems are widely used being easier approach. However,

new approaches like Corpus and Ontology proves promising

if it used for specific domain. Experiment setup is mostly on

standard dataset. The performance should be measured on

self-created dataset for a particular domain. Techniques such

as query pre-processing and query expansion helps to

improve performance on overall system.

III. PROBLEM DEFINITION

Information Retrieval systems since developed has

opened doors of knowledge across the world. Initially IR

systems were predominantly developed for very few or say

one language. Language has been barrier for users since the

content was restricted to few languages. Since a last decade,

the content on web is coming from different native languages

of user. This leads to introduction of Cross Language

Information Retrieval System (CLIR). Users are unable to

write a request in a native language which can be easily read

and understand by them. CLIR permits the user to retrieve

the documents in other language than the query language.

India being multilingual country, there is a wide scope for

CLIR for Indian languages. Some research work has been

done in CLIR for Indian languages. However, there is scope

of improvement in existing systems and inventing new

systems for remaining languages. Marathi is the language

spoken primarily by the native people of Maharashtra, a state

of India. The proposed English-Marathi Cross Language

Information Retrieval system will allow users to write a query

in English and retrieve documents in Marathi language.

A. Objectives of Proposed System

To develop English-Marathi Cross Language

Information Retrieval system by using query

translation approach.

To improve Precision and Recall of the system.

IV. PROPOSED WORK

The framework of proposed approach is described in

Figure 1 The proposed framework shows the working of

English-Marathi CLIR system in which user gives their query

in English language and the relevant documents are retrieved

in Marathi language. The documents will be used from FIRE

2010 Dataset of Marathi news corpus. The query will be

translated by using Sata-Anuvadak resources [15].

The steps involved in flow of the proposed system is given

below:

1. Firstly user enters the query in English language.

2. By using pre-query expansion we expand the English query

by using various tools like English WordNet.

3. Query translation translate refined English query to Marathi

query with query translation approach by using Sata-

Anuvadak resources.

Figure 1.Proposed System Architecture

Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254

IJCSCN | December-January 2016-2017 Available [email protected]

252

ISSN:2249-5789

Page 4: English-Marathi Cross Language Information Retrieval … ·  · 2017-02-01Languages namely, Hindi, Bangla, Marathi and English. The number of Internet users increasing day to day

4. Post-query expansion expands the Marathi query by using

Marathi WordNet.

5. This expanded Marathi query is fired to retrieve Marathi

relevant documents based on similarity between query and

documents.

V. CONCLUSION

Cross-lingual IR provides new paradigms in searching

documents through varieties of languages across the world.

CLIR for Indian languages has gained importance in last

decade and there is scope to explore much in this field.

Observation shows that there is a scope of improvement in the

performance level of CLIR. In this work an improved

English-Marathi based CLIR is proposed. The proposed

English-Marathi CLIR system will allow users to write a

query in English and to retrieve Marathi documents. Even

non-Marathi user can translate documents in own native

language.

Table 1. Comparison of Different CLIR Systems and Their Approaches

Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254

IJCSCN | December-January 2016-2017 Available [email protected]

253

ISSN:2249-5789

Page 5: English-Marathi Cross Language Information Retrieval … ·  · 2017-02-01Languages namely, Hindi, Bangla, Marathi and English. The number of Internet users increasing day to day

VI. REFERENCES

[1] P. Iswarya, Dr. V. Radha , International Journal Of Engineering

Research And Applications,"Cross Language Text Retrieval: A

Review" (IJERA) ISSN: 2248-9622 Vol.2, Issue 5, September- October

2012, pp.1036-1043

[2] Pothula Sujatha and P. Dhavachelvan, "A Review on the Cross and

Multilingual Information Retrieval"

[3] International Journal of Web & Semantic Technology (IJWesT) Vol.2,

No. 4, October 2011,DOI : 10. 5121/ijwest.2011.2409 115

[4] A. Nagarathinam,Dr. S. Saraswathi, "State of Art: Cross Lingual

Information Retrieval System for Indian Languages",International

Journal of Computer Applications (0975 – 8887) Volume 35– No.13,

December 2011.

[5] D. Thenmozhi, C. Aravindan “Tamil-English Cross Lingual

Information Retrieval System for Agriculture Society”.

[6] D. Mandal, S. Dandapat, M. Gupta, P. Banerjee, S. Sarkar, “Bengali

and Hindi to English Cross-language Text Retrieval under Limited

Resources”, At the 8th Workshop of the Cross-Language Evaluation

Forum, Budapest, Hungary, 19-21 September 2007.

[7] S. Varshney, J. Bajpai, “Improving performance of English-Hindi

Cross Language Information Retrieval using Transliteration of query

terms” 2013 IEEE International Conference in MOOC, Innovation and

Technology in Education (MITE), 978-1-4799-1626-9/13/2013 IEEE.

[8] Manoj Kumar Chinnakotla, Sagar Ranadive, Om P. Damani and

Pushpak Bhattacharyya,” Hindi to English and Marathi to English

Cross Language Information Retrieval Evaluation”.

[9] Mallamma V Reddy, Dr. M. Hanumanthappa, “ Kannada and Telugu

Native Languages to English Cross Language Information Retrieva ,”

International Journal of Computer Science and Information

Technologies, Vol.2.

[10] Pattabhi R. K. Rao. , and Sobha, L. Cross Lingual Information Retrieval

Track: Tamil – English, Working notes from FIRE 2010, Feb 2010.

[11] Jagadeesh Jagarlamudi and Kumaran, A. Cross-Lingual Information

Retrieval System for Indian Languages, Proceedings of CLEF 2007,

2007.

[12] Karunesh Arora, Ankur Garg, Gour Mohan, Somiram Singla, Chander

Mohan. Cross Lingual Information Retrieval Efficiency Improvement

through Transliteration, Proceedings of ASCNT 2009, 65-71, 2009.

[13] Dr. Saraswathi, S., Asma Siddhiqaa, M., Kalaimagal, K., and

Kalaiyarasi M. BiLingual Information Retrieval System for English and

Tamil, Journal Of Computing, 2,4, 85-89, April 2010.

[14] Anurag Seetha,Sujoy Das, M. Kumar,"Evaluation of the English-Hindi

Cross Language Information Retrieval System Based on Dictionary

Based Query Translation Method" 10th International Conference on

Information Technology,0-7695-3068-0/07,2007, IEEE DOI,

10.1109/ICIT.2007. 53

[15] Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah,

Pushpak Bhattacharyya “Sata-Anuv _ adak : Tackling Multiway

Translation of Indian Languages”LREC 2014

Kalyani Lokhande et al, International Journal of Computer Science & Communication Networks,Vol 6(6),250-254

IJCSCN | December-January 2016-2017 Available [email protected]

254

ISSN:2249-5789