Chapter 4 Issues in Information Retrieval for Hindi Language

56
Chapter 4: Issues in Information Retrieval for Hindi Language A Study of Web Mining Tools for Query Optimization Page 86 Chapter 4 Issues in Information Retrieval for Hindi Language 4.1 Background Hindi is the third most widely-spoken language in the world (after English and Mandarin): an estimated 500-600 million people speak the language. A direct descendant of Sanskrit through Prakrit and Apabhramsha, Hindi belongs to the Indo-Aryan group of languages, a subset of the Indo-European family. It has been influenced and enriched by Persian, Turkish, Farsi, Arabic, Portuguese, and English. Hindi is broadly identical with Urdu, the official language of Pakistan, and is closely related to Bengali, Punjabi and Guajarati. A good knowledge of Hindi is therefore likely to be useful to anyone having an interest in the countries of South Asia or in the numerous South Asian communities of the world. There are no particular difficulties in the study of the language. Hindi inherited its writing system from Sanskrit. The script, Devanagari, is extremely logical and therefore straightforward and easy to learn. Pronunciation is easy because, unlike English, letters are always pronounced exactly the same way. It can be used for both exact and rational reasoning and the expressive form suited for poetry and songs. The general appearance of the Devanagari script is that of letters 'hanging from a line'. This 'line', also found in many other South Asian scripts, is actually a part of most of the letters and is drawn as the writing proceeds. The script has no capital letters. Hindi is the official language of the Republic of India, and the common second language of Mauritius, Fiji, Trinidad, Guyana and Surinam. The Hindi alphabet consists of 11 vowels and 33 consonants. The Devanagari script used for Hindi is derived from the ancient Brahmi and is closely related to other Indian scripts such as Gujarati and Bengali. Hindi was originally a variety of Hindustani spoken in the area of New Delhi. There are hundreds of Hindi dialects.

Transcript of Chapter 4 Issues in Information Retrieval for Hindi Language

Page 1: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 86

Chapter 4

Issues in Information Retrieval for Hindi Language

41 Background

Hindi is the third most widely-spoken language in the world (after English

and Mandarin) an estimated 500-600 million people speak the language A direct

descendant of Sanskrit through Prakrit and Apabhramsha Hindi belongs to the

Indo-Aryan group of languages a subset of the Indo-European family It has been

influenced and enriched by Persian Turkish Farsi Arabic Portuguese and

English Hindi is broadly identical with Urdu the official language of Pakistan

and is closely related to Bengali Punjabi and Guajarati A good knowledge of

Hindi is therefore likely to be useful to anyone having an interest in the countries

of South Asia or in the numerous South Asian communities of the world

There are no particular difficulties in the study of the language Hindi

inherited its writing system from Sanskrit The script Devanagari is extremely

logical and therefore straightforward and easy to learn Pronunciation is easy

because unlike English letters are always pronounced exactly the same way It

can be used for both exact and rational reasoning and the expressive form suited

for poetry and songs

The general appearance of the Devanagari script is that of letters hanging

from a line This line also found in many other South Asian scripts is actually a

part of most of the letters and is drawn as the writing proceeds The script has no

capital letters

Hindi is the official language of the Republic of India and the common second

language of Mauritius Fiji Trinidad Guyana and Surinam

The Hindi alphabet consists of 11 vowels and 33 consonants

The Devanagari script used for Hindi is derived from the ancient Brahmi

and is closely related to other Indian scripts such as Gujarati and Bengali

Hindi was originally a variety of Hindustani spoken in the area of New

Delhi

There are hundreds of Hindi dialects

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 87

The Hindi language has been enriched by Persian Turkish Farsi Arabic

Portuguese and English

Today Hindi is widely spoken in South Asia (India Pakistan Nepal and

Bhutan) South Africa Mauritius the USA Trinidad Fiji Surinam Guyana

Yemen Uganda New Zealand Malaysia and Singapore [58]

42 Characteristics of Hindi Language and Devanagri Script

Hindi is written using the Devanagari script Devanagari is also used to

write other languages such as Nepali and Marathi and is the most common script

used to write Sanskrit Several other languages have scripts which are related to

Devanagari such as Bengali Punjabi and Gujarati

The Devanagari script represents the sounds of the Hindi language with

remarkable consistency Whereas many letters of the English alphabet can be

pronounced many different ways the letters of the Devanagari script are

pronounced consistently (with a few minor exceptions) Thus Devanagari is

relatively easy to learn

Devanagari consists of 11 vowels and 33 consonants and is written from left to

right

421 Basic Genius

Devanagari is not actually an alphabet but a so-called alphasyllabary An

alphasyllabary is a writing system which is primarily based on consonants and in

which vowel symbols are requisite yet secondary As such the fundamental

genius of Devanagari is that every letter represents a consonant which is followed

by an inherent schwa vowel अ For example the letter सis read sa In order to

suppress the inherent vowel one of two methods is required a diacritical mark

called a halant or a ligature called a conjunct In order to indicate a vowel other

than the inherent vowel diacritical marks called maatraas are used For vowels

independent of consonants there exist full letters to transcribe vowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 88

422 Vowels

Hindi has 11 vowels 10 vowels are transcribed in two distinct forms the

independent form and the dependent (maatraa) form The independent form is

used when the vowel letter appears alone at the beginning of a word or

immediately following another vowel letter The dependent form is used when the

vowel follows a consonant

Vowels in Independent Form

अआइईउऊऋएऐओऔ

The following table lists the vowel in its independent form and its description

The best way to learn the pronunciation is to learn from a native speaker

Vowels

Vowel Description

अ as in but again

आ as in father far

इ as in fit hit

ई as in feet heat

उ as in put pull

ऊ as in pool shoot

ऋ as is rip rib

ए as in ate day

ऐ as in man bat

ओ as in go boat

औ as in saw taught

Table 41 lists the vowel in its independent form and its description

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 89

423 Vowels in Dependent (maatraa) Form

When a vowel follows a consonant it is written in its respective maatraa

form which is appended to the consonant Matraa forms never appear at the

beginning of a word or after another vowel The first vowel अ has no particular

maatraa form Instead it is the default vowel It is assumed to be present unless the

maatraa form of another vowel is explicitly appended to a consonant In Sanskrit

the vowel अis pronounced at the end of a word In Hindi however it is not

pronounced except at the end of single-letter words The following table lists

each vowel in its independent form its corresponding dependent form and how it

would appear with the consonant क (k)

Independent Dependent With क

अ (none) क

आ ा क

इ िा कक

ई ा की

उ ा क

ऊ ा क

ऋ ा क

ए ा क

ऐ ा क

ओ ा क

औ ा क

Table 42 Maatraa Forms of Vowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 90

424 Allophones

As mentioned earlier the distinction between the vowels इand ईis the

duration of the pronunciation of the vowel - the former is shorter and the latter

longer However in practice the vowel इis pronounced more like the English i

as in the word it as described in the corresponding text The same is so for the

vowels उand ऊ

425 Final Schwa

The schwa अ is normally not pronounced at the end of a word Thus

क नis pronounced kaan not kaana An exception occurs when a word ends in

a conjunct In this case the word may be pronounced with a slight final schwa as

in मभतर literally mitr but often pronounced like mitr(a) with a soft final

schwa

426 Monophthongs versus Diphthongs

Native English speakers should be careful not to pronounce the Hindi

vowels that are monophthongs as diphthongs For instance ओis a pure sound not

a glide like the English o as in the word low Many vowel letters in English

can represent diphthongs Thus whereas English may represent a diphthong with

the letter i as in the word site in Devanagari this diphthong would be more

precisely transcribed as two monopthongs आand ई स ईट

427 Schwa Syncope

Sometimes the inherent vowel is not pronounced despite its implicit

presence and the lack of any modifying diacritic This phenomenon is called

schwa syncope or alternatively schwa deletion For instance consider the word

नभकीन literally namakeen The second inherent vowel is not pronounced as if

the word were written नमकीन ( namkeen) There is no rule which can predict

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 91

this phenomenon with absolute accuracy yet one generally useful heuristic is that

the inherent vowel is deleted after a consonant which is between two vocalic

consonants Thus the word द वन गयीitself is pronounced with the first schwa

deleted like Devnagari and not Devanagari even though it is still

transliterated as Devanagari

Occasionally the schwa will not be totally deleted but will be very slightly

pronounced

428 Schwa Pronunciation in Context

The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is

similar to the English e as in the word bed but only in certain contexts

namely when two अvowels appear on both sides of the consonant ह as in the

verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such

circumstances Thus although the phrase ऩहनर is literally pahan lo it is often

pronounced pehen lo Occasionally however this phenomenon occurs when

only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In

this case both vowels adjacent to हare converted to [ɛ] and thus although the

word is literally bahin it is pronounced behen

429 Nasalization of Vowels

All vowels in Hindi can be nasalized except for ऋ Nasalization is

indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is

called bindu (dot) and the latter symbol is called chandrabindu (moon and

dot) The bindu is used when part or the entire vowel symbol extends above the

horizontal line The chandrabindu is used when no part of the vowel symbol

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 92

extends above the horizontal line The bindu is more common is modern written

Hindi and may even be used exclusively

The following examples summarize the use of the bindu and chandrabindu

अ आ इ ईउ ऊ ए ऐ ओ औ

क क कक की क क क क कोकौ

A special diacritic is sometimes used with the vowel आto transcribe the English

o vowel sound as in college कॉर ज

4210 Consonants Velar Consonants

Letter Description

क unaspirated k

ख aspirated k

ग unaspirated g

घ aspirated g

ङ n as in sing

Table 43 Consonants Velar Consonants

Note that the velar nasal consonant does not appear as the first letter of any word

4211 Palatal Consonants

Letter Description

च Un-aspirated ch as in

cheese

छ aspirated ch

ज Un-aspirated j

झ aspirated j

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 2: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 87

The Hindi language has been enriched by Persian Turkish Farsi Arabic

Portuguese and English

Today Hindi is widely spoken in South Asia (India Pakistan Nepal and

Bhutan) South Africa Mauritius the USA Trinidad Fiji Surinam Guyana

Yemen Uganda New Zealand Malaysia and Singapore [58]

42 Characteristics of Hindi Language and Devanagri Script

Hindi is written using the Devanagari script Devanagari is also used to

write other languages such as Nepali and Marathi and is the most common script

used to write Sanskrit Several other languages have scripts which are related to

Devanagari such as Bengali Punjabi and Gujarati

The Devanagari script represents the sounds of the Hindi language with

remarkable consistency Whereas many letters of the English alphabet can be

pronounced many different ways the letters of the Devanagari script are

pronounced consistently (with a few minor exceptions) Thus Devanagari is

relatively easy to learn

Devanagari consists of 11 vowels and 33 consonants and is written from left to

right

421 Basic Genius

Devanagari is not actually an alphabet but a so-called alphasyllabary An

alphasyllabary is a writing system which is primarily based on consonants and in

which vowel symbols are requisite yet secondary As such the fundamental

genius of Devanagari is that every letter represents a consonant which is followed

by an inherent schwa vowel अ For example the letter सis read sa In order to

suppress the inherent vowel one of two methods is required a diacritical mark

called a halant or a ligature called a conjunct In order to indicate a vowel other

than the inherent vowel diacritical marks called maatraas are used For vowels

independent of consonants there exist full letters to transcribe vowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 88

422 Vowels

Hindi has 11 vowels 10 vowels are transcribed in two distinct forms the

independent form and the dependent (maatraa) form The independent form is

used when the vowel letter appears alone at the beginning of a word or

immediately following another vowel letter The dependent form is used when the

vowel follows a consonant

Vowels in Independent Form

अआइईउऊऋएऐओऔ

The following table lists the vowel in its independent form and its description

The best way to learn the pronunciation is to learn from a native speaker

Vowels

Vowel Description

अ as in but again

आ as in father far

इ as in fit hit

ई as in feet heat

उ as in put pull

ऊ as in pool shoot

ऋ as is rip rib

ए as in ate day

ऐ as in man bat

ओ as in go boat

औ as in saw taught

Table 41 lists the vowel in its independent form and its description

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 89

423 Vowels in Dependent (maatraa) Form

When a vowel follows a consonant it is written in its respective maatraa

form which is appended to the consonant Matraa forms never appear at the

beginning of a word or after another vowel The first vowel अ has no particular

maatraa form Instead it is the default vowel It is assumed to be present unless the

maatraa form of another vowel is explicitly appended to a consonant In Sanskrit

the vowel अis pronounced at the end of a word In Hindi however it is not

pronounced except at the end of single-letter words The following table lists

each vowel in its independent form its corresponding dependent form and how it

would appear with the consonant क (k)

Independent Dependent With क

अ (none) क

आ ा क

इ िा कक

ई ा की

उ ा क

ऊ ा क

ऋ ा क

ए ा क

ऐ ा क

ओ ा क

औ ा क

Table 42 Maatraa Forms of Vowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 90

424 Allophones

As mentioned earlier the distinction between the vowels इand ईis the

duration of the pronunciation of the vowel - the former is shorter and the latter

longer However in practice the vowel इis pronounced more like the English i

as in the word it as described in the corresponding text The same is so for the

vowels उand ऊ

425 Final Schwa

The schwa अ is normally not pronounced at the end of a word Thus

क नis pronounced kaan not kaana An exception occurs when a word ends in

a conjunct In this case the word may be pronounced with a slight final schwa as

in मभतर literally mitr but often pronounced like mitr(a) with a soft final

schwa

426 Monophthongs versus Diphthongs

Native English speakers should be careful not to pronounce the Hindi

vowels that are monophthongs as diphthongs For instance ओis a pure sound not

a glide like the English o as in the word low Many vowel letters in English

can represent diphthongs Thus whereas English may represent a diphthong with

the letter i as in the word site in Devanagari this diphthong would be more

precisely transcribed as two monopthongs आand ई स ईट

427 Schwa Syncope

Sometimes the inherent vowel is not pronounced despite its implicit

presence and the lack of any modifying diacritic This phenomenon is called

schwa syncope or alternatively schwa deletion For instance consider the word

नभकीन literally namakeen The second inherent vowel is not pronounced as if

the word were written नमकीन ( namkeen) There is no rule which can predict

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 91

this phenomenon with absolute accuracy yet one generally useful heuristic is that

the inherent vowel is deleted after a consonant which is between two vocalic

consonants Thus the word द वन गयीitself is pronounced with the first schwa

deleted like Devnagari and not Devanagari even though it is still

transliterated as Devanagari

Occasionally the schwa will not be totally deleted but will be very slightly

pronounced

428 Schwa Pronunciation in Context

The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is

similar to the English e as in the word bed but only in certain contexts

namely when two अvowels appear on both sides of the consonant ह as in the

verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such

circumstances Thus although the phrase ऩहनर is literally pahan lo it is often

pronounced pehen lo Occasionally however this phenomenon occurs when

only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In

this case both vowels adjacent to हare converted to [ɛ] and thus although the

word is literally bahin it is pronounced behen

429 Nasalization of Vowels

All vowels in Hindi can be nasalized except for ऋ Nasalization is

indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is

called bindu (dot) and the latter symbol is called chandrabindu (moon and

dot) The bindu is used when part or the entire vowel symbol extends above the

horizontal line The chandrabindu is used when no part of the vowel symbol

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 92

extends above the horizontal line The bindu is more common is modern written

Hindi and may even be used exclusively

The following examples summarize the use of the bindu and chandrabindu

अ आ इ ईउ ऊ ए ऐ ओ औ

क क कक की क क क क कोकौ

A special diacritic is sometimes used with the vowel आto transcribe the English

o vowel sound as in college कॉर ज

4210 Consonants Velar Consonants

Letter Description

क unaspirated k

ख aspirated k

ग unaspirated g

घ aspirated g

ङ n as in sing

Table 43 Consonants Velar Consonants

Note that the velar nasal consonant does not appear as the first letter of any word

4211 Palatal Consonants

Letter Description

च Un-aspirated ch as in

cheese

छ aspirated ch

ज Un-aspirated j

झ aspirated j

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 3: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 88

422 Vowels

Hindi has 11 vowels 10 vowels are transcribed in two distinct forms the

independent form and the dependent (maatraa) form The independent form is

used when the vowel letter appears alone at the beginning of a word or

immediately following another vowel letter The dependent form is used when the

vowel follows a consonant

Vowels in Independent Form

अआइईउऊऋएऐओऔ

The following table lists the vowel in its independent form and its description

The best way to learn the pronunciation is to learn from a native speaker

Vowels

Vowel Description

अ as in but again

आ as in father far

इ as in fit hit

ई as in feet heat

उ as in put pull

ऊ as in pool shoot

ऋ as is rip rib

ए as in ate day

ऐ as in man bat

ओ as in go boat

औ as in saw taught

Table 41 lists the vowel in its independent form and its description

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 89

423 Vowels in Dependent (maatraa) Form

When a vowel follows a consonant it is written in its respective maatraa

form which is appended to the consonant Matraa forms never appear at the

beginning of a word or after another vowel The first vowel अ has no particular

maatraa form Instead it is the default vowel It is assumed to be present unless the

maatraa form of another vowel is explicitly appended to a consonant In Sanskrit

the vowel अis pronounced at the end of a word In Hindi however it is not

pronounced except at the end of single-letter words The following table lists

each vowel in its independent form its corresponding dependent form and how it

would appear with the consonant क (k)

Independent Dependent With क

अ (none) क

आ ा क

इ िा कक

ई ा की

उ ा क

ऊ ा क

ऋ ा क

ए ा क

ऐ ा क

ओ ा क

औ ा क

Table 42 Maatraa Forms of Vowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 90

424 Allophones

As mentioned earlier the distinction between the vowels इand ईis the

duration of the pronunciation of the vowel - the former is shorter and the latter

longer However in practice the vowel इis pronounced more like the English i

as in the word it as described in the corresponding text The same is so for the

vowels उand ऊ

425 Final Schwa

The schwa अ is normally not pronounced at the end of a word Thus

क नis pronounced kaan not kaana An exception occurs when a word ends in

a conjunct In this case the word may be pronounced with a slight final schwa as

in मभतर literally mitr but often pronounced like mitr(a) with a soft final

schwa

426 Monophthongs versus Diphthongs

Native English speakers should be careful not to pronounce the Hindi

vowels that are monophthongs as diphthongs For instance ओis a pure sound not

a glide like the English o as in the word low Many vowel letters in English

can represent diphthongs Thus whereas English may represent a diphthong with

the letter i as in the word site in Devanagari this diphthong would be more

precisely transcribed as two monopthongs आand ई स ईट

427 Schwa Syncope

Sometimes the inherent vowel is not pronounced despite its implicit

presence and the lack of any modifying diacritic This phenomenon is called

schwa syncope or alternatively schwa deletion For instance consider the word

नभकीन literally namakeen The second inherent vowel is not pronounced as if

the word were written नमकीन ( namkeen) There is no rule which can predict

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 91

this phenomenon with absolute accuracy yet one generally useful heuristic is that

the inherent vowel is deleted after a consonant which is between two vocalic

consonants Thus the word द वन गयीitself is pronounced with the first schwa

deleted like Devnagari and not Devanagari even though it is still

transliterated as Devanagari

Occasionally the schwa will not be totally deleted but will be very slightly

pronounced

428 Schwa Pronunciation in Context

The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is

similar to the English e as in the word bed but only in certain contexts

namely when two अvowels appear on both sides of the consonant ह as in the

verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such

circumstances Thus although the phrase ऩहनर is literally pahan lo it is often

pronounced pehen lo Occasionally however this phenomenon occurs when

only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In

this case both vowels adjacent to हare converted to [ɛ] and thus although the

word is literally bahin it is pronounced behen

429 Nasalization of Vowels

All vowels in Hindi can be nasalized except for ऋ Nasalization is

indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is

called bindu (dot) and the latter symbol is called chandrabindu (moon and

dot) The bindu is used when part or the entire vowel symbol extends above the

horizontal line The chandrabindu is used when no part of the vowel symbol

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 92

extends above the horizontal line The bindu is more common is modern written

Hindi and may even be used exclusively

The following examples summarize the use of the bindu and chandrabindu

अ आ इ ईउ ऊ ए ऐ ओ औ

क क कक की क क क क कोकौ

A special diacritic is sometimes used with the vowel आto transcribe the English

o vowel sound as in college कॉर ज

4210 Consonants Velar Consonants

Letter Description

क unaspirated k

ख aspirated k

ग unaspirated g

घ aspirated g

ङ n as in sing

Table 43 Consonants Velar Consonants

Note that the velar nasal consonant does not appear as the first letter of any word

4211 Palatal Consonants

Letter Description

च Un-aspirated ch as in

cheese

छ aspirated ch

ज Un-aspirated j

झ aspirated j

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 4: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 89

423 Vowels in Dependent (maatraa) Form

When a vowel follows a consonant it is written in its respective maatraa

form which is appended to the consonant Matraa forms never appear at the

beginning of a word or after another vowel The first vowel अ has no particular

maatraa form Instead it is the default vowel It is assumed to be present unless the

maatraa form of another vowel is explicitly appended to a consonant In Sanskrit

the vowel अis pronounced at the end of a word In Hindi however it is not

pronounced except at the end of single-letter words The following table lists

each vowel in its independent form its corresponding dependent form and how it

would appear with the consonant क (k)

Independent Dependent With क

अ (none) क

आ ा क

इ िा कक

ई ा की

उ ा क

ऊ ा क

ऋ ा क

ए ा क

ऐ ा क

ओ ा क

औ ा क

Table 42 Maatraa Forms of Vowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 90

424 Allophones

As mentioned earlier the distinction between the vowels इand ईis the

duration of the pronunciation of the vowel - the former is shorter and the latter

longer However in practice the vowel इis pronounced more like the English i

as in the word it as described in the corresponding text The same is so for the

vowels उand ऊ

425 Final Schwa

The schwa अ is normally not pronounced at the end of a word Thus

क नis pronounced kaan not kaana An exception occurs when a word ends in

a conjunct In this case the word may be pronounced with a slight final schwa as

in मभतर literally mitr but often pronounced like mitr(a) with a soft final

schwa

426 Monophthongs versus Diphthongs

Native English speakers should be careful not to pronounce the Hindi

vowels that are monophthongs as diphthongs For instance ओis a pure sound not

a glide like the English o as in the word low Many vowel letters in English

can represent diphthongs Thus whereas English may represent a diphthong with

the letter i as in the word site in Devanagari this diphthong would be more

precisely transcribed as two monopthongs आand ई स ईट

427 Schwa Syncope

Sometimes the inherent vowel is not pronounced despite its implicit

presence and the lack of any modifying diacritic This phenomenon is called

schwa syncope or alternatively schwa deletion For instance consider the word

नभकीन literally namakeen The second inherent vowel is not pronounced as if

the word were written नमकीन ( namkeen) There is no rule which can predict

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 91

this phenomenon with absolute accuracy yet one generally useful heuristic is that

the inherent vowel is deleted after a consonant which is between two vocalic

consonants Thus the word द वन गयीitself is pronounced with the first schwa

deleted like Devnagari and not Devanagari even though it is still

transliterated as Devanagari

Occasionally the schwa will not be totally deleted but will be very slightly

pronounced

428 Schwa Pronunciation in Context

The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is

similar to the English e as in the word bed but only in certain contexts

namely when two अvowels appear on both sides of the consonant ह as in the

verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such

circumstances Thus although the phrase ऩहनर is literally pahan lo it is often

pronounced pehen lo Occasionally however this phenomenon occurs when

only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In

this case both vowels adjacent to हare converted to [ɛ] and thus although the

word is literally bahin it is pronounced behen

429 Nasalization of Vowels

All vowels in Hindi can be nasalized except for ऋ Nasalization is

indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is

called bindu (dot) and the latter symbol is called chandrabindu (moon and

dot) The bindu is used when part or the entire vowel symbol extends above the

horizontal line The chandrabindu is used when no part of the vowel symbol

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 92

extends above the horizontal line The bindu is more common is modern written

Hindi and may even be used exclusively

The following examples summarize the use of the bindu and chandrabindu

अ आ इ ईउ ऊ ए ऐ ओ औ

क क कक की क क क क कोकौ

A special diacritic is sometimes used with the vowel आto transcribe the English

o vowel sound as in college कॉर ज

4210 Consonants Velar Consonants

Letter Description

क unaspirated k

ख aspirated k

ग unaspirated g

घ aspirated g

ङ n as in sing

Table 43 Consonants Velar Consonants

Note that the velar nasal consonant does not appear as the first letter of any word

4211 Palatal Consonants

Letter Description

च Un-aspirated ch as in

cheese

छ aspirated ch

ज Un-aspirated j

झ aspirated j

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 5: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 90

424 Allophones

As mentioned earlier the distinction between the vowels इand ईis the

duration of the pronunciation of the vowel - the former is shorter and the latter

longer However in practice the vowel इis pronounced more like the English i

as in the word it as described in the corresponding text The same is so for the

vowels उand ऊ

425 Final Schwa

The schwa अ is normally not pronounced at the end of a word Thus

क नis pronounced kaan not kaana An exception occurs when a word ends in

a conjunct In this case the word may be pronounced with a slight final schwa as

in मभतर literally mitr but often pronounced like mitr(a) with a soft final

schwa

426 Monophthongs versus Diphthongs

Native English speakers should be careful not to pronounce the Hindi

vowels that are monophthongs as diphthongs For instance ओis a pure sound not

a glide like the English o as in the word low Many vowel letters in English

can represent diphthongs Thus whereas English may represent a diphthong with

the letter i as in the word site in Devanagari this diphthong would be more

precisely transcribed as two monopthongs आand ई स ईट

427 Schwa Syncope

Sometimes the inherent vowel is not pronounced despite its implicit

presence and the lack of any modifying diacritic This phenomenon is called

schwa syncope or alternatively schwa deletion For instance consider the word

नभकीन literally namakeen The second inherent vowel is not pronounced as if

the word were written नमकीन ( namkeen) There is no rule which can predict

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 91

this phenomenon with absolute accuracy yet one generally useful heuristic is that

the inherent vowel is deleted after a consonant which is between two vocalic

consonants Thus the word द वन गयीitself is pronounced with the first schwa

deleted like Devnagari and not Devanagari even though it is still

transliterated as Devanagari

Occasionally the schwa will not be totally deleted but will be very slightly

pronounced

428 Schwa Pronunciation in Context

The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is

similar to the English e as in the word bed but only in certain contexts

namely when two अvowels appear on both sides of the consonant ह as in the

verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such

circumstances Thus although the phrase ऩहनर is literally pahan lo it is often

pronounced pehen lo Occasionally however this phenomenon occurs when

only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In

this case both vowels adjacent to हare converted to [ɛ] and thus although the

word is literally bahin it is pronounced behen

429 Nasalization of Vowels

All vowels in Hindi can be nasalized except for ऋ Nasalization is

indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is

called bindu (dot) and the latter symbol is called chandrabindu (moon and

dot) The bindu is used when part or the entire vowel symbol extends above the

horizontal line The chandrabindu is used when no part of the vowel symbol

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 92

extends above the horizontal line The bindu is more common is modern written

Hindi and may even be used exclusively

The following examples summarize the use of the bindu and chandrabindu

अ आ इ ईउ ऊ ए ऐ ओ औ

क क कक की क क क क कोकौ

A special diacritic is sometimes used with the vowel आto transcribe the English

o vowel sound as in college कॉर ज

4210 Consonants Velar Consonants

Letter Description

क unaspirated k

ख aspirated k

ग unaspirated g

घ aspirated g

ङ n as in sing

Table 43 Consonants Velar Consonants

Note that the velar nasal consonant does not appear as the first letter of any word

4211 Palatal Consonants

Letter Description

च Un-aspirated ch as in

cheese

छ aspirated ch

ज Un-aspirated j

झ aspirated j

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 6: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 91

this phenomenon with absolute accuracy yet one generally useful heuristic is that

the inherent vowel is deleted after a consonant which is between two vocalic

consonants Thus the word द वन गयीitself is pronounced with the first schwa

deleted like Devnagari and not Devanagari even though it is still

transliterated as Devanagari

Occasionally the schwa will not be totally deleted but will be very slightly

pronounced

428 Schwa Pronunciation in Context

The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is

similar to the English e as in the word bed but only in certain contexts

namely when two अvowels appear on both sides of the consonant ह as in the

verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such

circumstances Thus although the phrase ऩहनर is literally pahan lo it is often

pronounced pehen lo Occasionally however this phenomenon occurs when

only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In

this case both vowels adjacent to हare converted to [ɛ] and thus although the

word is literally bahin it is pronounced behen

429 Nasalization of Vowels

All vowels in Hindi can be nasalized except for ऋ Nasalization is

indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is

called bindu (dot) and the latter symbol is called chandrabindu (moon and

dot) The bindu is used when part or the entire vowel symbol extends above the

horizontal line The chandrabindu is used when no part of the vowel symbol

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 92

extends above the horizontal line The bindu is more common is modern written

Hindi and may even be used exclusively

The following examples summarize the use of the bindu and chandrabindu

अ आ इ ईउ ऊ ए ऐ ओ औ

क क कक की क क क क कोकौ

A special diacritic is sometimes used with the vowel आto transcribe the English

o vowel sound as in college कॉर ज

4210 Consonants Velar Consonants

Letter Description

क unaspirated k

ख aspirated k

ग unaspirated g

घ aspirated g

ङ n as in sing

Table 43 Consonants Velar Consonants

Note that the velar nasal consonant does not appear as the first letter of any word

4211 Palatal Consonants

Letter Description

च Un-aspirated ch as in

cheese

छ aspirated ch

ज Un-aspirated j

झ aspirated j

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 7: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 92

extends above the horizontal line The bindu is more common is modern written

Hindi and may even be used exclusively

The following examples summarize the use of the bindu and chandrabindu

अ आ इ ईउ ऊ ए ऐ ओ औ

क क कक की क क क क कोकौ

A special diacritic is sometimes used with the vowel आto transcribe the English

o vowel sound as in college कॉर ज

4210 Consonants Velar Consonants

Letter Description

क unaspirated k

ख aspirated k

ग unaspirated g

घ aspirated g

ङ n as in sing

Table 43 Consonants Velar Consonants

Note that the velar nasal consonant does not appear as the first letter of any word

4211 Palatal Consonants

Letter Description

च Un-aspirated ch as in

cheese

छ aspirated ch

ज Un-aspirated j

झ aspirated j

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 8: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 93

Table 44 Palatal Consonants

4212 Retroflex Consonants

Table 45 Retroflex Consonants

Hindi additionally employs two flap consonants डand ढ The symbols for these

consonants are formed by placing a diacritical mark called a nuqta which is a

subscript dot underneath the consonant symbols डand ढrespectively डis

pronounced by flapping the tongue from the retroflex position forward toward the

alveolar ridge ढis pronounced similarly except with aspiration English does

have an alveolar flap consonant as the t in the word better or the d as in

bedding as in American English The Hindi flaps are retroflex however

4213 Dental Consonants

Letter Description

त like t but dental and un-aspirated

ञ n as in punch

Letter Description

ट like t but retroflex and un-

aspirated

ठ like t but retroflex and aspirated

ड like d but retroflex and un-

aspirated

ढ like d but retroflex and aspirated

ण like n but retroflex

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 9: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 94

थ like t but dental and aspirated

द like d but dental and un-aspirated

ध like d but dental and aspirated

न like n in name but dental

Table 46 Dental Consonants

4214 Labial Consonants

Letter Description

ऩ like p but un-aspirated

प like p but aspirated

फ like b but un-aspirated

ब like b but aspirated

भ m

Table 47 Labial Consonants

4215 Semivowels

Letter Description

म y as in young

य like r but often rolled

र l as in lip

व either w or v

Table 48 Semivowels

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 10: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 95

The Hindi r sound is typically a flap However some speakers may trill the r

sound occasionally or may even occasionally pronounce it closer to an unflapped

approximant sound as in the English r in red

4216 Sibilants

Letter Description

श sh as in shave

ष like sh but retroflex

स s as in save

Table 49 Sibilants

4217 Glottal

Letter Description

ह like h but voiced

Table 410 Glottal

4218 Allophony of w and v in Hindi

A phoneme is an equivalent class of atomic discrete sounds which can

produce a difference in meaning when spoken yet cannot produce a difference in

meaning when substituted for one another A phone is simply a distinct sound

For instance in English the p in the word spit and in the word pit are

pronounced distinctly the former is aspirated the latter is unaspirated Thus they

are two distinct phones However they are both members of the same phoneme

since substituting one for the other can never produce a difference in meaning

even though substitution may be perceived as slightly awkward by native

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 11: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 96

speakers Two distinct phones which are both members of the same phoneme are

called allophones (from Greek different sounds)

In Hindi the sounds associated with the English letters w and v are

allophones Both are transcribed with one letter व Aanalogously to the English

example above these sounds are typically pronounced consistently in words but

they do not constitute meaningful differences in utterances For example the

word व is typically pronounced as vo whereas the suffix -व र is typically

pronounced wala Hindi speakers are not generally aware of this distinction

even though they pronounce the distinction fairly consistently just as English

speakers are not aware of the differences of aspiration in certain letters yet

pronounce aspiration consistently

Thus वmay be pronounced as w or v Some speakers may even

pronounce an intermediate sound Semi-Allophones j and z in Hindi

Likewise Hindi speakers do not generally maintain any strict distinction

between the English j and z sounds either but will typically pronounce words

consistently This situation is not quite the same as w and v since technically

the z sound can be represented distinctly from the j sound by placing a dot

(nuqta) underneath the letter and some speakers are aware of this distinction For

instance the word ज is pronounced as jo There is some variation however in

some words such as जम द - some speakers pronounce this as zyada and some

as jyada

4219 English Alveolar Consonants

There is no equivalent of the English t or d in Hindi These English

sounds are pronounced with the tongue on the tip of the alveolar ridge behind the

top teeth This place of articulation is between the Devanagari retroflex and dental

positions although the English pronunciation will sound much closer to the

retroflex pronunciation to Hindi speakers English loanwords containing t or d

are therefore transcribed with retroflex approximations

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 12: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 97

Capital Letters

Devanagari has no capital letters

Special Matraa Forms of उand ऊwith य

य + उ = र

य + ऊ = र

4220 Borrowed Sounds

There are 6 additional sounds used in Hindi which have no corresponding

symbols in Devanagari These sounds are represented by placing the nuqta

underneath a symbol which is phonetically similar These symbols represent

sounds from other languages such as Persian Arabic and English

42201 Foreign Sounds

Letter Approximation

like k but pronounced in the back of the

mouth

ऽ velar fricative like Bach in German

ा velar sound similar to ऽbut voiced

ज just as English z as in zoo

झ similar to the s in English vision

फ just as English f

Table 411 Foreign Sounds

Only two of the borrowed sounds are typically pronounced distinctly from the

non-nuqta forms though जand फ

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 13: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 98

42202 Conjuncts

Since any consonant that is not explicitly followed by a vowel symbol is

implicitly followed by the inherent vowel अ Devanagari provides two means of

suppressing the inherent vowel

The halant (ा ) a diacritical subscript eg क

A conjunct a ligature synthesized by conjoining two consonant symbols This

method is much more common The halant is typically only used when

typographical difficulties make it difficult to use conjuncts

42203 Horizontal Conjuncts

Horizontal conjuncts are formed when the first letter of a conjunct

contains a vertical line The vertical line is deleted and then the modified

consonant symbol is conjoined to the second consonant symbol For example

न + द = नद हहनदी

च + छ = चछ अचछ

स + त = सत नभसत

र + र = लर बफलरी

भ + फ = मफ रमफ

फ + त = फत भ फत

क + म = कम कमो

Note that in the last two examples although neither कnor पend in a vertical line

they still can be the first letter of a horizontal conjunct The curve on the right side

is shortened and adjoined to the following consonant

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 14: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 99

42204 Vertical Conjuncts

Consonants that do not end with a vertical line often form vertical

conjuncts with the following consonant The first consonant is written on top of

the second consonant For example

ट + ट = टट छ टटी

ट + ठ = टठ चचटठी

42205 Other Conjuncts

Certain conjuncts are special and should be observed If a nasal consonant

is the first member of a conjunct it may be written either using a regular

conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above

the horizontal line to the right side of the preceding consonant or vowel For

instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड

Note that the anusvar always indicates a so-called homorganic nasal consonant -

in other words it is articulated in the same location in the mouth as the following

consonant is articulated Thus the anusvar in हह दीmust represent न which is a

dental nasal consonant since द the following letter represents a dental

consonant Likewise the anusvar in अ ड must represent the retroflex nasal

consonant णsince the following consonant ड is a retroflex consonant

Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar

represents a consonant which is the first letter of a conjunct whereas the bindu

and chandrabindu represent the nasalization of a vowel The bindu in हcannot be

considered an anusvar since there is no conjunct The anusvar in हह दीis not

considered a bindu since it represents a consonant that is the first member of a

conjunct

Conjuncts with य

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 15: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 100

As the first member of a conjunct यappears like a small hook or sickle above

and to the right of the following consonant

य + भ = भम शभ म

य + ट + ई = टम ऩ टी

As the second member of a conjunct यis indicated by a diagonal line adjoined to

the vertical line of the preceding consonant

क + य = कर श ककरम

भ + य = मर उमर

Four consonants ट ठ ड ढ do not have any vertical line so they indicate a

following यwith the symbol like an inverted v as follows

ट + य = टर य षटटर

Special Conjuncts

Some conjuncts look quite different than their component consonants and are not

obvious Most of these occur in words borrowed from Sanskrit

क + ष =

त + त = तत

त + य = तर

ज + ञ = ऻ

द + द = दद

द + ध = दध

द + म = दम

द + व = दव

श + य = शर

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 16: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 101

ह + भ = हभ

The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are

treated as a single unit and a maatraa is placed before the entire conjunct

There are hundreds of conjuncts but most conjuncts are easily discernable

Punctuation

Hindi has one punctuation sign the viraam which is a vertical line which

terminates a sentence Other punctuation such as commas and question marks is

borrowed from English In modern typography periods are also used in placed of

the viraam

[59][60]

43 Unicode and fonts

Computers store characters by assigning a number to each one This

process is known as encoding Most of us are familiar with ASCII which is a 7 bit

encoding of the characters in the English language (it can store at most 128

characters) With the passage of time the need was felt for a single encoding that

could contain enough characters to accommodate all the languages in the world

To enable sharing of information this encoding would need to be a standard

accepted universally That standard is Unicode Unicode is a 32 bit encoding

which can potentially give a unique number to each character in all languages

known to man

Actually there is another international standard the ISO 10646 of the

International Organization for Standardization (ISO) which defines the Universal

Character Set (UCS) Fortunately the participants of both projects (ISO and

Unicode) realized in around 1991 that two different unified character sets is not

exactly what the world needs They joined their efforts and worked together on

creating a single encoding Both projects still exist and publish their respective

standards independently but have agreed to keep the encoding of the Unicode and

ISO 10646 standards compatible

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 17: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 102

431 Various Encoding Forms

Encoding standards define the numerical value or code point of a

particular character but that is not all They must also define how this value will

be represented in bits when stored in a computer file or transmitted over the

Internet The Unicode Standard defines three encoding forms that define how a

particular character will be represented in bits while being transmitted The three

encoding forms allow the same data to be transmitted in a byte word or double

word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode

the same common character repertoire and can be efficiently transformed into one

another without loss of data The three encoding forms as defined by the Unicode

Consortium are

UTF-8

UTF-8 is popular for HTML and similar protocols UTF-8 is a way of

transforming all Unicode characters into a variable length encoding of bytes It

has the advantages that the Unicode characters corresponding to the familiar

ASCII set have the same byte values as ASCII and that Unicode characters

transformed into UTF-8 can be used with much existing software without

extensive software rewrites

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to

characters with economical use of storage It is reasonably compact and all the

heavily used characters fit into a single 16-bit code unit while all other characters

are accessible via pairs of 16-bit code units

UTF-32

UTF-32 is popular where memory space is no concern but fixed width single

code unit access to characters is desired Each Unicode character is encoded in a

single 32-bit code unit when using UTF- 32

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 18: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 103

By the way UTF stands for UCS Transformation Format

432 UTF-8

UTF-8 has the benefit that the ASCII characters are still represented as a

single byte providing compatibility with file systems parsers and other software

that rely on US-ASCII values but are transparent to other values Any document

created using the ASCII encoding is a valid UTF-8 document

Non-ASCII characters are encoded using a variable length scheme and

may range from 2 to 6 bytes in size however the most commonly used characters

are only up to three bytes long The way that Non-ASCII characters are encoded

is

Non-ASCII characters are encoded as a sequence of several bytes each of

which has the most significant bit set This means that all bytes representing non-

ASCII characters are invalid under ASCII encoding (since all ASCII characters

stored in bytes have their most significant bit not set) This allows the application

to differentiate between ASCII and non-ASCII characters Bytes representing

non-ASCII characters will never be mistaken for ASCII characters

The first byte of a multibyte sequence that represents a non-ASCII

character indicates how many bytes follow for this character All further bytes in

the multibyte sequence are used to encode the actual character [61]

433 Unicode and Devanagari

The scripts of South Asia share so many common features that a side-by-

side comparison of a few will often reveals structural similarities even in the

modern letterforms With minor historical exceptions they are written from left to

right They are all abugidas in which most symbols stand for a consonant plus an

inherent vowel (usually the sound a) Wordinitial vowels in many of these

scripts have distinct symbols and word-internal vowels are usually written by

juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the

inherent vowel when that occurs is frequently marked with a special sign In the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 19: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 104

Unicode Standard this sign is denoted by the Sanskrit word virZma In some

languages another designation is preferred In Hindi for example the word hal

refers to the character itself and halant refers to the consonant that has its inherent

vowel suppressed in Tamil the word pukki is used The virama sign nominally

serves to suppress the inherent vowel of the consonant to which it is applied it is

a combining character with its shape varying from script to script Most of the

scripts of South Asia from north of the Himalayas to Sri Lanka in the south from

Pakistan in the west to the easternmost islands of Indonesia are derived from the

ancient Brahmi script The oldest lengthy inscriptions of India the edicts of

Ashoka from the third century were written in two scripts Kharoshthi and

Brahmi These are both ultimately of Semitic origin probably deriving from

Aramaic which was an important administrative language of the Middle East at

that time Kharoshthi written from right to left was supplanted by Brahmi and its

derivatives The descendants of Brahmi spread with myriad changes throughout

the subcontinent and outlying islands There are said to be some 200 different

scripts deriving from it By the eleventh century the modern script known as

Devanagari was in ascendancy in India proper as the major script of Sanskrit

literature This northern branch includes such modern scripts as Bengali

Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam

and Tamil The major official scripts of India proper including Devanagari are

all encoded according to a common plan so that comparable characters are in the

same order and relative location This structural arrangement which facilitates

transliteration to some degree is based on the Indian national standard (ISCII)

encoding for these scripts and makes use of a virama Sinhala has a virama-based

model but is not structurally mapped to ISCII Tibetan stands apart using a

subjoined consonant model for conjoined consonants reflecting its somewhat

different structure and usage The Limbu script makes use of an explicit encoding

of syllable-final consonants Many of the character names in this group of scripts

represent the same sounds and naming conventions are similar across the range

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 20: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 105

434 Devanagari U+0900ndashU+097F

The Devanagari script is used for writing classical Sanskrit and its modern

historical derivative Hindi Extensions to the Sanskrit repertoire are used to write

other related languages of India (such as Marathi) and of Nepal (Nepali) In

addition the Devanagari script is used to write the following languages Awadhi

Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi

(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi

Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari

Palpa and Santali

All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan

script and the Southeast Asian scripts are historically connected with the

Devanagari script as descendants of the ancient Brahmi script The entire family

of scripts shares a large number of structural features The principles of the Indic

scripts are covered in some detail in this introduction to the Devanagari script

The remaining introductions to the Indic scripts are abbreviated but highlight any

differences from Devanagari where appropriate

4341 Standards

The Devanagari block of the Unicode Standard is based on ISCII-1988

(Indian Script Code for Information Interchange) The ISCII standard of 1988

differs from and is an update of earlier ISCII standards issued in 1983 and 1986

The Unicode Standard encodes Devanagari characters in the same relative

positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The

same character code layout is followed for eight other Indic scripts in the Unicode

Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and

Malayalam This parallel code layout emphasizes the structural similarities of the

Brahmi scripts and follows the stated intention of the Indian coding standards to

enable one-to-one mappings between analogous coding positions in different

scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other

scripts depart to a greater extent from the Devanagari structural pattern so the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 21: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 106

Unicode Standard does not attempt to provide any direct mappings for these

scripts to the Devanagari order

In November 1991 at the time The Unicode Standard Version 10 was

published the Bureau of Indian Standards published a new version of ISCII in

Indian Standard (IS) 131941991 This new version partially modified the layout

and repertoire of the ISCII- 1988 standard Because of these events the Unicode

Standard does not precisely follow the layout of the current version of ISCII

Nevertheless the Unicode Standard remains a superset of the ISCII-1991

repertoire except for a number of new Vedic extension characters defined in IS

131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic

texts encoded with ISCII-1991 may be automatically converted to Unicode code

points and back to their original encoding without loss of information

4342 Encoding Principles

The writing systems that employ Devanagari and other Indic scripts

constitute abugidasmdasha cross between syllabic writing systems and alphabetic

writing systems The effective unit of these writing systems is the orthographic

syllable consisting of a consonant and vowel (CV) core and optionally one or

more preceding consonants with a canonical structure of (((C)C)C)V The

orthographic syllable need not correspond exactly with a phonological syllable

especially when a consonant cluster is involved but the writing system is built on

phonological principles and tends to correspond quite closely to pronunciation

The orthographic syllable is built up of alphabetic pieces the actual letters of the

Devanagari script These pieces consist of three distinct character types

consonant letters independent vowels and dependent vowel signs In a text

sequence these characters are stored in logical (phonetic) order [62]

44 Indian Languages on internet

Rise of Hindi Urdu and other Indian languages on the Web has lead

millions of non-English speaking Indians to discover uses of the Internet in their

daily lives They are sending and receiving e-mails searching for information

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 22: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 107

reading e-papers blogging and launching Web sites in their own languages Two

American IT companies Microsoft and Google have played a big role in making

this possible

A decade ago there were many problems involved in using Indian languages on

the Internet ―There was mismatch of fonts and keyboard layouts which made it

impossible to read any Hindi document if the user did not have the same fonts

There was chaos more than 50 fonts and 20 keyboards were being used and if

two users were following different styles there was no way to read the other

personlsquos documents But the advent of Unicode support for Hindi and Urdu

changed all that The concept of new character encoding from Unicode

Consortiummdasha nonprofit in California whose members include Google IBM

Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash

proved to be a boon for Indian languages Microsoft incorporated the Hindi

Unicode font Mangal in its operating system in 2001 ―Since then the Hindi

Unicode support has been a part of all subsequent up gradations of Microsoftlsquos

operating systems Also providing Input Method Editor Facilities give users the

option to use different types of keyboards says Meghashyam Karanam product

manager vision and localization at Microsoft India The earlier system could

incorporate only 127 characters which is not enough for the Hindi

Devnagariscript The Unicode system can incorporate up to 65000 characters As

most computers in India use Microsoftlsquos operating system it ensured that the

Hindi font was available to most of them as they upgraded the operating software

In 2004 the Hindi version of Microsoft Office 2003 which included Word

Excel PowerPoint and Outlook was launched Now the Hindi version of

Microsoft Office 2007 is also available ―It includes Hindi language interface

packs that allow users to create documents and communicate with others in Hindi

Users can also navigate using the menus and toolbars that are in Hindi We have

received a very good response from the Hindi users says Karanam Urdu

language support is available in Windows Vista and Office 2007 Another

Microsoft initiative is Project Bhasha which was launched in 2003 and now

provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 23: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 108

Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington

State partnered with one of the early Hindi portals webduniacom to launch its

MSN Hindi portal ―Webdunia also provided support for the Hindi version of

Microsoft Office as well as for language interface packs says Jaideep Karnik

general manager for content and localization at webduniacom The Indore

Madhya Pradesh-based company has an office in the United States and helps

major software developers localize their products If Microsoft built the base for

Hindi Google was ready to put up the superstructure Realizing the potential of

Indian languages the California-based company has launched various products in

the past two years With the Google Hindi and Urdu search engines one can

search all the Hindi and Urdu Web pages available on the Internet including

those that are not in Unicode font ―Google offers searching in 13 languages

Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five

languages and Google transliteration in Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most

recent language that Google has added to its offerings says Rahul Roy-

Chowdhury product manager at Google India To use the search function ―users

can type Hindi words in Roman script and a drop down menu suggests several

Hindi phrases By selecting the appropriate query users can search for Hindi

content without even typing in Hindi says Roy-Chowdhury Google has more

useful tools for non-English users Google News is available in Hindi With the

Google translation engine one can type English words and get a list of suggested

synonyms in Hindi A transliteration tool allows users to type any word in

English hit the space bar and get the same word in a different language Roy-

Chowdhury explains the process of adding a new language

―Google offers products first in Google Labs and waits for feedback from users

for a couple of months Then the feedback is collated and the product is updated

before introducing the language with its other offerings like Gmail Search

Blogger Translate and Orkut to name a few ―Urdu is currently available in

Googlelsquos transliteration offering on the Google Labs Web site and the language is

soon to be introduced in various other products he adds The efforts of

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 24: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 109

Microsoft Google and other developers have begun to produce results Page

views of major Hindi news Web sites are rising fast and most of the popular Hindi

newspapers have a Web presence now ―In the last two years page views of

navbharattimescom have increased significantly and half of them come through

Google as Net users generally search for a specific news item or query says

Nagar Yahoo with headquarters in California formed a partnership with Dainik

Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran

relationship helps us gain significant traction among Indian Internet users From

all the audience measures for this product this has been a resounding success

says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since

Yahoo and Jagran started working together page views have ―grown to about 14

million from one million a year and a half earlier says Upendra Swami who

heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit

Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi

Wikipedia now has more than 36000 articles ―It now appears to be the 52nd

largest Wikipedia in size compared to the over 260 individual language

Wikipedias says Jay Walsh head of communications at the California-based

Wikimedia Foundation ―Considering there are millions of Hindi speakers it is

certainly an important part of the Wikimedia Foundationlsquos mission to support the

growth of this project says Walsh Urdu Wikipedia started in January 2004 has

more than 10800 articles What are the challenges that still remain in the

popularization of Hindi and Urdu on the Internet ―The major challenge is

Internet penetration and PC prices The moment we have better Internet

penetration especially in smaller towns and PC prices go down Hindi and Indian

languages can flourish on the Net says Karnik of webduniacom India had more

than 49 million Internet users in June 2008 out of which about 9 million used the

Internet regularly according to a study by Juxtconsult India a research company

―There is a big opportunity in Indian languages Studies showed that only 28

percent of Indian Web surfers preferred English on the Web but as good quality

content in Indian languages was not easily available they did not visit many local

language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 25: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 110

agree ―Localization is the key to success in countries like India In order to get

the widest audience reach one has to look at Hindi because in a country of over a

billion people English is spoken by less than 80 million people says Krishna of

Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the

democratization of access to information he says adding that the Internet is not

a luxury but a powerful tool to improve life But is Hindi earning enough revenue

to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not

much says Mishra But Roy-Chowdhury thinks revenue is bound to come once

Hindi reaches a critical volume ―If we look at how the Internet developed in the

US it may provide a useful analogy First came content which was mostly

produced by people who had a passion for putting up content they cared about

Traffic and monetization was not the motive Second came growing readership as

people started discovering content This set off a virtuous cycle in which content

eventually became a viable monetizable business Third were the application

developers who could now focus on moving the online experience beyond passive

consumption of information to interactivity community building service delivery

and a host of other innovations Roy- Chowdhury says ―Indialsquos market was

stuck in phase one for a long time And I believe it has recently entered phase

two [63]

45 Development of Language Corpora in Indian Languages

Kolhapur Corpus of Indian English (KCIE) was the first Indian language

corpora for Indian English which was developed under the leadership of Prof

SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains

approximately one million words of Indian English drawn from materials

published in the year 1978 This is collected for a comparative study among the

American the British and the Indian English (Dash) Central Institute of Indian

Language (CIIL) is a nodal agency for development of Indian Language Corpora

It has co-coordinated with various Indian agencies and Universities for

developing more than 45 million corpora in Scheduled Language of India which

is also a part of TDIL program Enabling Minority Language Engineering

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 26: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 111

(EMILLE) program provides the corpora architecture and tool for Asian

languages It has a monolingual corpus which contains approximate 96157000

words and a parallel corpus consists of 200000 words of text in English which

helps in the translation of Bengali Hindi Punjabi and others languages

C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12

Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada

Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a

multilingual parallel corpus which is a repository of One Million Pageslsquo of

knowledge based text Mahatma Gandhi International University has started the

project Hindi Samghrahalsquo for repository of Hindi words database and dialect

mapping of Hindi Department of Information Technology of Government of

India has started the project for developing the Indian language Corpora Indian

Language Corpora Initiative (ILCI) ILCI is a consortium project for building the

parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New

Delhi It involves 11 Indian languages and also English

451 Machine Translation in India

Although Translation in India is old Machine Translation is

comparatively young Earlier efforts in this field have been noticed since 1980

involving different prominent Institutions such as IIT Kanpur University of

Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new

projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai

and Jadavpur University Kolkata were undertaken TDIL has started a

consortium mode project since April 2008 for building computational tools and

Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of

Hyderabad) The goal of this Project is to build childrenlsquos stories using

multimedia and e-learning content

452 Anglabharati

IIT Kanpur has developed the Anglabharti Machine Translator technology

from English to Indian languages under the leadership of Prof RMK Sinha It is

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 27: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 112

a rule-based system and has approximately 1750 rules 54000 lexical words

divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL

(Pseudo Lingua for Indian Language) as an intermediate language The

architecture of Anglabharti has six modules Morphological analyzer Parser

Pseudo code generator Sense disambiguator Target text generators and Post-

editor Hindi version of Anglabharti is AnglaHindi which is web based

application which is also available for use at httpanglahindiiitkacin To

develop automated translator system for regional languages Anglabharti

architecture has been adopted by various Indian institutes for example IIT

Guwahati

453 Anubharti

Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur

Anubharti is based on hybridized example-based approach The Second phase of

both the projects (Anglabharti II and Anubharti II) has started from 2004 with

new approaches and some structural changes

454 Anusaaraka

Anusaaraka is a Natural Language Processing (NLP) Research and

Development project for Indian languages and English undertaken by CIF

(Chinmaya International Foundation) It is fully-automatic general-purpose high-

quality machine translation systems (FGH-MT) It has software which can

translate the text of any Indian language(s) into another Indian Language(s) based

on Panini Ashtadhyayi (Grammar rules)It is developed at the International

Institute of Information Technology Hyderabad (IIIT-H) and Department of

Sanskrit Studies University of Hyderabad

455 Mantra

Machine Assisted Translation Tool (Mantra) is a brain child of Indian

Government during 1996 for translation of Government orders notifications

circulars and legal documents from English to Hindi The main goal was to

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 28: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 113

provide the translation tools to government agencies Mantra software is available

in all forms such as desktop network and web based It is based on Lexicalized

Tree Adjoining Grammar (LTAG) formalism to represent the English as well

as the Hindi grammar Initially it was domain specific such as Personal

Administration specifically Gazette Notifications Office Orders Office

Memorandums and Circulars gradually the domains were expanded At present

it also covers domains like Banking Transportation and Agriculture etc Earlier

Mantra technology was only for English to Hindi translation but currently it is

also available for English to other Indian Languages such as Gujarati Bengali and

Telugu MANTRA-Rajyasabha is a system for translating the parliament

proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I

Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of

Rajya Sabha (the upper house of the Parliament of India) provides funds for

updating the MANTRARajyasabha system

456 UNL-based MT System between English Hindi and Marathi

IIT Bombay has developed the Universal Networking Language (UNL)

based machine translation system for English to Hindi Language UNL is United

Nations project for developing the Interlingua for worldlsquos languages UNL-based

machine translation is developing under the leadership of Prof Pushpak

Bhattacharya IIT Bombay

457 English-Kannada MT System

Department of Computer and Information Sciences of Hyderabad

University has developed an English-Kannada MT system It is based on the

transfer approach and Universal Clause Structure Grammar (UCSG)This project

is funded by the Karnataka Government and it is applicable in the domain of

government circulars

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 29: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 114

458 SHIVA and SHAKTI MT

Shiva is an Example-based system It provides the feedback facility to the

user Therefore if the user is not satisfied with the system generated translated

sentence then the user can provide the feedback of new words phrases and

sentences to the system and can obtain the newly interpretive translated sentence

Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)

Shakti is a statistical approach based rule-based system It is used for the

translation of English to Indian languages (Hindi Marathi and Telugu) Users can

access the Shakti MT system at (httpshaktiiiitnet)[24]

459 Tamil-Hindi MAT System

K B Chandrasekhar Research Centre of Anna University Chennai has

developed the machine-aided Tamil to Hindi translation system The translation

system is based on Anusaaraka Machine Translation System and follows lexicon

translation approach It also has small sets of transfer rules Users can access the

system at httpwwwaukbcorgresearch_areasnlpdemomat

4510 Anubadok

Anubadok is a software system for machine translation from English to

Bengali It is developed in Perl programming language which supports processing

of Unicode encoded and text for text manipulations The system uses the Penn

Treebank annotation system for part-of-speech tagging It translates the English

sentence into Unicode based Bengali text Users can access the system at

httpbengalinuxsourceforgenetcgibinanubadokindexpl

4511 Punjabi to Hindi Machine Translation System

During 2007 Josan and Lehal at the Punjab University Patiala designed

Punjabi to Hindi machine translation system The system is built on the paradigm

of foreign machine translation system such as RUSLAN and CESILKO The

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 30: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 115

system architecture consists of three processing modules Pre Processing

Translation Engine and Post Processing

4512 Contribution of Private Companies in Evolving the ILT ndash Indian

language Search

45121 Engine Guruji

Gurujicom is the first Indian language search engine founded by the two

IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia

Capital gurujicom uses crawls technology based on propriety algorithms For

any query it goes into Indian languages contents deep and tries to return the

appropriate output guruji search engine covers a range of specific content news

entertainment travel astrology literature business education and more

45122 Google

Internet searching giant Google also supports major Indian Languages

such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam

and Punjabi and also provides the automated translation facility from English to

Indian Languages Google Transliteration Input Method Editor is currently

available for different languages such as Bengali Gujarati Hindi Kannada

Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu

45123 Microsoft Indic Input Tool

Microsoft has developed the Indic Input Tool for Indianisation of

computer applications The tool supports major Indian languages such as Bengali

Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based

conversion model WikiBhasa is Microsoft multilingual content creation tool for

translating Wikipedia pages into multilingual pages So source language in

WikiBhasa will be English and Target language can be any Indian local

language(s)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 31: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 116

45124 Webdunia

Webdunia is an important private player which assists the development of

Indian language technology in different areas such as text translation software

Localization and Website localizations It is also involved in research and

development of Corpus creationcollection and Content Syndication Moreover it

provides the facility of language consultancy It has developed various

applications in Indian Languages such as My Webdunia Searching Language

Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest

Calendar etc

45125 Modular InfoTech

Modular InfoTech Pvt Ltd is a pioneer private company for development

of Indian Languages software It provides the Indian language enablement

technology to many state governments and central government in e-governance

programs It has developed the software for multilingual content creation for

publishing newspapers and also has developed the qualitative Unicode based

Fonts for major Indian languages It has specifically developed the Shree-Lipi

Gurjrati pacakage for the Gujarati language which is useful in DTP sector

corporate offices and e-Governance program of the Government of Gujarat

4513 Government Effort for Evolving Language Technology

Indian government was aware about this fact Since 1970 the Department

of Electronics and the Department of Official Language were involved in

developing the Indian language Technology Consequently ISCII (Indian Script

Code for Information Interchange) is developed for Indian languages on the

pattern of ASCII (American Standard Code for Information Interchange) Also

Indian languages Transliteration (ITRANS) developed by Avinash Chopde

and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et

al 2005) The Department of Information Technology under Ministry of

Communication and Information Technology is also putting the efforts for

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 32: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 117

proliferation of Language Technology in India And other Indian government

ministries departments and agencies such as the Ministry of Human Resource

DRDO (Defense Research and Development Organization) Department of

Atomic Energy All India Council of Technical Education UGC (Union Grants

Commission) are also involved directly and indirectly in research and

development of Language Technology All these agencies help develop important

areas of research and provide funds for research to development agencies As an

end-result IndoWordNet was developed for the Indian languages on the pattern of

English WordNet

45131 TDIL Program

Government of India launched TDIL (Technology Development for Indian

Language) program TDIL decides the major and minor goal for Indian Language

Technology and provide the standard for language technology TDIL journal

Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals

for developing Language Technology in India[64]

46 Search Engines available in Hindi Hindi Online Search Tools

India centric localized search engines market is saturating fast real fast In

last year alone there must have been more than 10-15 Indian local search engines

launched Some smaller and some biggerSome with huge funding and some with

none This space is so crowded right now that it is difficult to know who is really

winning However we attempt to put forth a brief overview of current scenario

Here are some of them who fall in the localized Indian search engine category

Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial

Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo

lemmefindin along with Ask Laila which launched a couple of days back Also

we do have localized versions of big giants Google Yahoo and MSN

Each of these Indian search engines have come forward with some or the

other USP (Unique Selling Proposition) It is too early to pass a judgment on any

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 33: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 118

of them These are testing stages and every start-up is adding new features and

making their services better

461 Most Used Search Tools in India for web activity a Survey by Juxt

Consult 2008-2009 Report

4611 Most Used Websites

Websites 2008 Stats 2009 Stats

Google 37 35

Yahoo 32 25

Rediff 7 4

Orkut 6 7

More Info India online 2008 India online 2009 [65]

Table 412 Most Used Websites

4612 Info Search English

Website 2008 Stats 2009 Stats

Google 81 76

Yahoo 7 7

Wikipedia 3 6

English 3 4

More Info India online 2008 India online 2009 [65]

Table 413 Info Search English

4613 Info Search Local Language

Website 2008 Stats 2009 Stats

Google 65 34

Yahoo 12 29

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 34: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 119

Rediff 4 15

Teluguone 2 0

Guruji 1 18

Raftaar 1 02

Hindi 1 NA

Webdunia 1 07

Khoj 07 2

More Info India online 2008 India online 2009 [65]

Table 414 Info Search Local Language

47 Problems faced while search in Hindi Low recall

The preliminary investigation into typical information access technologies

by applying present day popular techniques show a severe problem of low recall

while accessing information using Indian language queries For instance many

times popular web search engines such as Google Yahoo and Guruji result in `0

search results for Indian language queries giving an impression that no documents

containing this information exist In reality these search engines face a recall

problem while dealing with Indian languages due to the multiple spellings

morphological variants of keywords and English keywords in HindiTable 415

illustrates a few such cases For example a Hindi query for ―world trade center

aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in

`0 documents in table 415 however a small rephrasing of the query in table 416

shows that these keywords exist in second search result But just saying we have a

recall problem may not be sufficient The next obvious question that follows

would be `how much is it a problemlsquo In other words we need to somehow

quantify the problem For this purpose we conducted many experiments to

determine at what levels does this recall problem occur and by how much

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 35: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 120

Table 415 Problems faced while search in Hindi Low recall

Table 416 Improved Recall

48 Factors affecting performance of Hindi search

481 Morphological Factors

Hindi language is morphologically rich language It has well defined

morphological structure and well defined grammar But the grammatical and

language structural standard is least followed due to various reasons One of the

reasons is the language diversity in India Including Hindi there are about 28

Languages spoken in India and Hindi being the National Language of India is

influenced by the regional languages which results a change in dialects not only in

Hindi Query Google Yahoo Guruji

वलडम टर ड स नटय आत कव दी हरभ 0 0 0

इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0

वलडम टर ड सटय आतॊकी हभर

8820 92 12

वलडम टर ड सटय आतॊकीअटक

331 10 1

इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम

708 50 1

बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 36: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 121

speaking but writing also Every language uses some markers like (English

language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are

used with a root word and new words are constructed For ex (Planning in

English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological

variants of root word Yojnaa म जन It is desirableto combine all the

morphological variants of the words in a single canonical form The process is

called as word stemming and this canonical form is called as root word or base

word

482 Phonetic nature of Hindi Language and Spelling variations

The major reasons for spelling variations in language can be attributed to

the phonetic nature of Indian languages and multiple dialects transliteration of

proper names words borrowed from regional and foreign languages and the

phonetic variety in Indian language alphabet The variety in the alphabet

different dialects and influence of regional and foreign languages has resulted in

spelling variations of the same word For example Following are the possible

spelling variations for the Hindi word अ गर ज (angrējī) (means English)

There are numerous words which are phonetically equivalent but vary in writing

The word school in hindi can be written in different ways (सक र सक र सक र)

When information is searched for a single standard keyword school सक र and non

standard Hindi phonetic equivalent keyword सक र 69 million results are shown

by Google for former and 14 million for later Hindi Language is influenced by

the other regional languages which results in phonetic variety of words for

example the English word school (सक र in Hindi) is pronounced and written as

ISKOOLइसक र by the majority of population of India in different states For the

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 37: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 122

Hindi word ISKOOLइसक र more than two thousand results are found Search

engines should be capable of retrieving the results against phonetically equivalent

words of keywords entered to search User may use any keyword for searching

and search engines should be capable to support all phonetically equivalent

words

Also no particular standard exists for writing the keyword to fetch Hindi web

data For every phonetically equivalent keywords in the query variation in the

results exist Ie a different set of documents are retrieved with least repetition

The native Hindi user may not be aware of the Phonetic issues in Hindi IR and

may miss the relevant information of hisher use

483 Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has following

commonly spoken synonyms गहन ज वयअर क य

484 Ambiguous words

Ambiguous words deflate the relevancy of the results The examples

mentioned below shows this aspect very clearly Consider the following query

(In English) (Women like gold)

(In Hindi) (न यी क स न ऩस द ह )

In this query the word स न (Gold) is ambiguous as it has another meaning ie to

sleep In the context of above query the word स न is gold But it can be also

interpreted as women like to sleep

Another Query (In English) (The common peoples choice)

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 38: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 123

(In Hindi) (आभ र गो की ऩस द)

Here the word आभ is ambiguous The word आभ in above query means common

However In Hindi it also means mango So the above query can be interpreted as

―mango is peoplelsquos choice

Many words are polysemous in nature Finding the correct sense of the words in a

given context is an intricate task One word has more than one meaning and

meaning of word is depends on context of sentence Exampleकय (Tax) having

synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय

(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context

485 Influence of English on Hindi Information retrieval

The English language has influenced Indian languages in many ways it

affected the pronunciation of Hindi words So many English words have been

localized in India Some of the words appear as if they were native Hindi words

Indians sometimes are unable to get the equivalent word for that of the English

For instance the words such as road bus pen television radio please rail

email password insurance internet director department etc are used even by

the uneducated Indians without being aware of the language of those words Most

of the Indians use these words in English than in their native language English

language has its influence over Hindi not only in speaking but in writing too

When we talk about especially Hindi literature on web it becomes more evident

Influence of English onHindi language has been observed as one of the very

important parameters for Hindi Information retrieval

The effect of aforesaid factors on Hindi information retrieval is shown in

following tables and figures

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 39: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 124

Table 417 List of Hindi queries

49 Discussion Morphological Factors

We have taken a sample set of 50 queries to test the affect of the root

word Following Table 417 is the set of randomly selected queries from the set

which throw light on the effect of the root word on the performance of Hindi

language search engines Table 418 shows the examples for effect of

morphological factors on Hindi queries

S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English

1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species

11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem

2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems

21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide

3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides

31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness

32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients

4 षवर पतह न ऩयझ र Lake on the verge of

extinction 92 भ नमसकय ग Mental Patient

41 षवर पतह न ऩयझ र Lakes on the verge of

extinction 10 गर भ णषवक सम जन Policy for village

5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village

51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village

6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural

office

61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural

offices

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 40: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 125

Table 418 Effect of morphological factors on Hindi queries

S No Root

word

s

Listing of Keywords Morphological

variants

Documents Returned

Google Bing Guruji Google Bing Guruji

1

ब यत

वष मवन

ब यतवष मवन वष मवनवन

वष मवनवन ब यतवष मवन

वन

11 ब यत

वष मवन 50500 4680 485

12 ब यत मवष मवनो

40400 680 61

2 द घमटन

द घमटन द घमटन ओ

द घमटन द घमटन 21 द घमटन 133000 2410 284

22 द घमटन ओ 117000 420 23

3 ब ष ब ष ब ष ओ ब ष ब ष

31 ब ष 161000 8330 961

32 ब ष ए 6200 935 188

33 ब ष ओ 6090 441 356

4 झ र झ रझ रो झ र झ र

41 झ र 4740 278 25

42 झ र 1270 28 1

5 आऩद

आऩद आऩद ओ

आऩद आऩद 51 आऩद 102000 4030 410

52 आऩद ए 1160 64 20

6

ऩ परज तत

ऩ ऩकषमोपरज ततपरज ततमोपरज तत

ऩ परज तत

ऩ परज तत

61 ऩ परज तत 48200 1670 98

62 ऩकषमोपरज तत

47600 1150 84

63 ऩकषमोपरज ततम

33800 747 25

7 सभसम

सभसम ए सभसम ओ सभ

सम सभसम सभसम

71 सभसम 584000 30200 1889

72 सभसम ओ 584000 7150 1356

8

कीटन शक

कीटन शक

कीटन शको कीटन शक कीटन शक

81 कीटन शक 36300 1360 333

82 कीटन शको 35800 800 270

9 य ग य गोय ग य ग य ग

91 य ग 205000 21600 1423

92 य चगमो 128000 3280 239

93 य ग 112000 6280 647

10 म जन

म जन ओ म जन

म जन म जन

101 म जन 673000 18500 3343

102 म जन ओ 669000 6020 990

103 म जन ए 673000 2860 416

11 क दर क दरीमक दर क दर क दर

111 क दर 261000 11300 655

112 क नदरो 29500 1850 105

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 41: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 126

Table 419 precision values of the three search engines

Figure 41 precision values of the three search engines

0

02

04

06

08

1

11 21 31 33 42 52 62 71 81 91 93 102 111

P GoogleP BingP Guruji

S No Query Precision 10 SNO Query Precision 10

Google Bing Guruji Google Bing Guruji

11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम

09 06 01

12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02

22 हव ईद घमटन क क यण

05 03 01 72 क षषसभसम ओ 07 04 02

22 हव ईद घमटन ओ क क यण

03 03 01 81 कीटन शकक इसत भ र

1 08 02

31 ब यतभफ रीज न व रीब ष

09 05 05 82 कीटन शकोक इसत भ र

09 06 02

32 ब यतभफ रीज न व रीब ष ए

05 04 03 91 भ नमसकय ग 07 06 04

33 ब यतभफ रीज न व रीब ष ओ

05 02 02 92 भ नमसकय चगमो 09 06 03

41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04

42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन

09 07 03

51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ

1 06 0

52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए

08 06 02

61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0

62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 42: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 127

It has been observed that documents returned by all three search engines are more

in number when query with root word is submitted This justifies the searching of

documents in the root word because in general we get better results with the

keywords in their root form

It has also been observed that only Google shows listing of morphological

variants of root words where as Bing and Guruji show only listing of root word

supplied in almost all the sample queries listed above in the table

From the above results it is evident that only Google indexes the documents

keyword in their root form Bing and Guruji do not index in that form that is the

reason number of documents retrieved in their case is less in comparison to

Google The overall comparison of results from the three search engines in tables

above show that in general the quantity of results retrieved increased when the

keywords are used in their root form In case of search engines the quality of

results is more important than the quantity Figure 41and table 419 shows the

comparison of the precision values of the three search engines The precision

value is calculated by taking the top 10 results of the search engines On closely

observing the results we can say that precision value in case of Google is high in

almost all queries As mentioned above Google does its indexing in the root form

of keywords it can be said that that relevancy of the results is also high in Google

in comparison to other two search engines which denotes that not only quantity

but the quality of results is also affected by the morphological variations in the

keywords

410 Discussion Phonetic nature of Hindi Language and Spelling variations

Search engines should be capable of retrieving the results against

phonetically equivalent words of keywords entered to search User may use any

keyword for searching and search engines should be capable to support all

phonetically equivalent words

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 43: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 128

Following are randomly selected queries from the set of 50 queries tested on

Google search engine Tables below show the results and precision offered by

Google

Table 420 results of the search engine on Phonetic nature of Hindi language

Hindi Query

With Bold

Standard

Keywords

Phonetic variations of the Keywords Google Results

for query having

keywords

No of

Results

Precision

10

ससजिमो भ िहयीर

ऩद थम

सिबजमो सबज मो

जहयीर जहरयर ससजिमोिहयीर 97 09

ससजिमो जहयीर 632 09

सिबजमो िहयीर 194 09

आसभान छ त भहॊगाई

आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10

आसभ न भहॉगाई 1040 10

आसभ भह ग ई 14 06

आसभाॊ भह ग ई 563 07

भरषटाचाय स आिादी

भरशट च य

बयषटट च य आज दी भरषटाचायआज दी 211000 08

भरशटाचायआज दी 214 06

बयषटाचाय आज दी 447 07

भरषटट च यआिादी 1090000 09

भरशट च य आिादी 1040 07

बयषटट च यआिादी 1190 08

अननाहिाय क आनदोरन

अनन हज य आ द रनआ द रन अनन हज य आनदोरन

84700 03

अनन हज य आॊदोरन

85100 08

अनन हज य क आॉदोरन

78 06

अनना हिाय क आनद रन

399 05

अनन हिाय क आनद रन

3260000 10

फयोिगायी सभसम सभ ध न

फ य जग यी फय जग यी फ य जा यी

फयोिगायी 9650 09

फयोिगायी 80600 10

फयोिगायी 170 07

फयोिगायी 30 05

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 44: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 129

Figure 42 Precision Charts for Phonetic nature of Hindi language

In the above table and figure it can be clearly seen that search engines return a

handful of documents on various Hindi phonetically equivalent queries It is

observed that no particular standard exists for writing the keyword to fetch Hindi

34

33

33

Query No 1

1 11 12

31

30

18

21

Query No 2

2 21 22 23

18

13

1520

16

18

Query No 3

3 31 32

33 34 35

5823

109

Query No 4

1st Qtr 2nd Qtr

3rd Qtr 4th Qtr

29

32

23

16

Query No 5

5 51 52 53

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 45: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 130

web data For every phonetically equivalent keywords in the query variation in

the results exist Ie a different set of documents are retrieved with least

repetition From the precision chart it is clearly observed that the degree of

relevance for queries containing phonetically equivalent keywords is almost same

or nearly equal The native Hindi user may not be aware of the Phonetic issues in

Hindi IR and may miss the relevant information of hisher use

411 Discussion Words Synonyms

A word can express a myriad of implications connotations and attitudes

in addition to its basic ―dictionary meaning And a word often has near

synonyms that differ from it solely in these nuances of meaning Choosing the

right word can be difficult for people as well as for the information retrieval

system For example the word (आब षण) in Hindi (Ornament) in English has

following commonly spoken synonyms गहन ज वयअर क य

Table 421 and Figure 43 have been presented below which shows the

comparison of precision values against three search engines

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 46: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 131

Table 421 Effect of word synonyms on Hindi IR

S NO Query Standard

Hindi

Words

Synonyms Documents Returned

Google Per

10

Bing Per

10

Guruj

i

Per

10

1 स न क आब

षण

आब षणगहन

11 स न क आब षण 217000 08 3250 07 381 05

12 स न क गहन 188000 08 2590 06 389 05

13 स न क ज वय 78900 08 1670 07 311 06

14 स न क अर क य 9490 05 633 04 70 0

15 स न क आबयण 493 05 38 03 1 0

2 क र फ दर फ दर

21 क र फ दर 233000 07 7510 07 733 03

22 क र भ घ 40700 09 1500 08 99 06

23 क र जरधय 1570 06 54 06 2 02

3 सतर सशिकतकयण

सतर न यी

31 सतर सशिकतकयण

9950 09 1570 07 760 06

32 न यीसशिकतकयण

29300 09 1910 09 736 04

33 भहहर सशिकतकयण

96300 09 5160 08 1091 03

34 औयतसशिकतकयण

7670 08 680 07 510 02

4 मसक दयक अ हक य

अ हक य

41 मसक दयक अ हक य

1990 04 18 03 60 06

42 मसक दयक अमबभ न

2400 06 304 02 16 01

43 मसक दयक घभ ड

495 05 54 06 9 01

5 व रग ओ व

51 व रग ओ 6960 1 698 08 29 05

52 ऩ डरग ओ 13400 1 1080 09 143 09

53 दयखतरग ओ 481 05 19 05 0 0

6 आ खद न आ ख

61 आ खद न 34000 07 3690 05 312 03

62 न तरद न 77500 1 3240 1 159 09

63 च द न 2450 08 427 01 36 01

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 47: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 132

Figure 43 Comparison of precision values against three search engines

From the examples above it is observed that using Hindi keywords with their

synonyms improves the information retrieval against a query in Hindi language

Not only quantity of documents returned is affected but quality is also affected by

using synonyms of Hindi keywords

From the above table and figure it is to be observed that documents returned by

Google are more in quantity than other two search engines and least number of

documents get returned by Guruji search engine the reason behind may be

availability of less documents or poor indexing However we are interested in

quality of results than quantity As far as quality of results is concerned it can be

clearly seen that Google and Bing provide quality data than Guruji And in the

average case Google still stands first in the row that means precision values by

Google are more than that of Bing and Guruji in this case Thus it becomes clear

that by changing a keyword into its synonym equivalent results can be obtained

Therefore it is evident that synonyms of keywords play an important role in the

process of Hindi information retrieval system

412 Discussion Ambiguity

In a sample set of 50 ambiguous queries below we present five randomly

selected ambiguous queries In figure 35 second column contains five queries in

Hindi third column holds the ambiguous keyword in one context and fifth

column holds the same ambiguous keyword in other context Fourth and sixth

columns hold the meaning of queries in English with respect to the ambiguous

keyword in context

0

02

04

06

08

1

11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63

Google

Bing

Guruji

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 48: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 133

Table 422 List of randomly selected ambiguous queries

Ambiguous queries mentioned above in the figure are tested for results against

three search engines Google Bing and Guruji Results are shown below in tables

Table 423 Ambiguity test for Google

Table 424 Ambiguity test for Bing

SNo Query For keyword

as

In English For keyword as In English

1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)

Women like to

sleep

2 आभ र गो की ऩस द आभ

(common)

Common manlsquos

choice आभ (Mango) Mango is

peoplelsquos choice

3 फ र षवक सऔय ऩ षण फ र (Children) Child Development

and Nutrition फ र(Hair) Hair

Development and

Nutrition

4 सऩ यो क पन पन (Art) Art of snake

charmers पन(Snake head) Snake charmerlsquos

snake head

5 म दध भ क र षवन श क र

(Aggregate)

Aggregate

destruction in wars क र(family) Destruction of

families in war

Query Ambiguous

keyword Documents

returned

Google Other

Context Results Found

Context Context

1 स न 50800 Gold 5 To sleep 2 3

2 आभ 488000 Common 3 Mango 3 4

3 फ र 2900000 Children 7 Hair 3 0

4 पन 184 Art 0 Snake head 10 0

5 क र 17800 Aggregate 2 Family 3 5

Query Ambiguous

keyword Documents

returned

Bing Other

Context Results Found

Context Context

1 स न 2680 Gold 2 To sleep 2 6

2 आभ 17800 Common 3 Mango 3 4

3 फ र 4030 Children 6 Hair 2 2

4 पन 25 Art 0 Snake head 9 1

5 क र 1900 Aggregate 0 Family 2 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 49: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 134

Table 425 Ambiguity test for Guruji

From the above results obtained in tables it is observed that all three search

engines return documents without differentiating between the contexts of

keyword in the query In the above table the last column labeled as ―other

Context holds the number of results which are not relevant to the query supplied

or those documents which contains the keywords in other non required context

From the results it is clear that all search engines return documents in different

contexts Therefore it can be said that search engines underperform when supplied

with ambiguous queries Numbers in column labeled as ―other Context signifies

the deviation from relevance For example for query म दध भ क र षवन श

(aggregate destruction in wars) the column ―Other Context for Google contains 5

documents for Bing contains 8 documents and for Guruji contains all 10

documents

In another query सऩ यो क पन (art of snake charmers) another context (Snake

charmerlsquos snake head) retrieved documents are expected to be in context (art) but

from the above results obtained it can be seen that google returns all 10

documents in non required context (snake head) and Bing returns 9 documents

where as Guruji fails to retrieve even a single document In the above scenario it

becomes important for the search engines to address to the issue of ambiguity in

keywords to obtain better results

Query Ambiguous

keyword Documents Returned

Guruji Other

Context Results Found

Context Context

1 स न 109 Gold 0 To sleep 0 10

2 आभ 6756 Common 3 Mango 0 7

3 फ र 635 Children 5 Hair 2 3

4 पन No Results Found

Art na Snake head na na

5 क र 84 Aggregate 0 Family 0 10

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 50: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 135

413 Discussion Influence of English on Hindi Information retrieval

English language has its influence over Hindi not only in speaking but in

writing too When we talk about especially Hindi literature on web it becomes

more evident Influence of English on Hindi language has been observed as one of

the very important parameters for Hindi Information retrieval which is more

clearly explained in the example as

Example In English the word exercise is written in Hindi as (एकसयस इज) The

word exercise (एकसयस इज) has following phonetic variations एकसयस इज

एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic

variations mentioned above in the example a variety of popular keywords and

queries have been tested for experiments from various domains

In the following figure a sample set of common and popular English keywords

along with their phonetic variants written in Hindi can be seen

English Words

Google Transliteration

Standard Hindi Keywords

Phonetic Equivalents Search Engine Google

Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800

Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710

Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820

Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100

Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000

0 537000

University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270

Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300

Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160

Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट

3240 38000 639 1120

specialist टऩशसअशरटट

सऩ मशममरसट

सऩ शमरसट सऩ श मरसट

143 1440 51300 9820

Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 51: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 136

Table 426 keywords along with their phonetic variants written in Hindi

From the above table it can be observed that search engine does return documents

for single keyword query documents for all phonetic variants of the keywords are

also returned which are huge in number It can also be seen that that people have

their own way of representing the Hindi words and no standard is followed for

storing Hindi data on web Also the documents are retrieved for every

phonetically variant English Keyword written in Hindi script In the above table

the column with bold Hindi entries shows the keywords which are obtained by

using Google transliteration tool The table 426 shows that transliteration does

not provide correct Hindi word in most of the cases For example the correct

transliteration for word University should be म तनवमसमटीwhereas Google

transliteration provides the word उतनव मसमतम which is completely wrong It is

clearly evident from the figure above that 1540 documents have been retrieved for

the wrong keyword उतनव मसमतम (University) and the same follows for other single

word queries Insurance इनस य स

ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy

ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers

make use of unchecked and non standard transliteration which makes the Hindi IR

process a difficult task

In the next example Multiword Hindi queries are selected to test the effect

of English influence on Hindi IR on precision and quantity of documents

retrieved The Hindi query is transformed into it variants by the software (design

and working discussed in next chapter) by replacing the Hindi keywords with

English keywords written in Hindi without changing the meaning of the query

Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 52: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 137

The queries are converted into two levels In first level one Hindi word is replaced

by its English equivalent and in second level more than one words is replaced by

their English equivalent words without changing the meaning of the original

Hindi query Example

An English query ―Foreign investment in India can be written in Hindi as

―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न

and तनव श means ―investment ―इनव सटभट The query for the two levels is

transformed as

प य न तनव श ब यत भ (Foreign nivesh bharat mein)

प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)

Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi

nivesh Bharat mein supplied by the user is transformed into two equivalent

senses containing a mixture of both English and Hindi language where meaning

of the query remains same From the sample set of one hundred queries some

randomly selected queries are presented below in Table 427

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 53: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 138

Table 427 Transformed queries into two equivalent senses containing a mixture

of both English and Hindi Tabular representation

Figure 44 Transformed Queries into two equivalent senses

1520

8360

1020

Hindi Query 1

Level 1

Level 2

95800

37200

2150

Hindi Query 2

Level 1

Level 2

209000

85800

1660

Hindi Query 3

Level 1

Level 2

112000

0

49000271

Hindi Query 4

Level 1

Level 2

658000

52766

Hindi Query 5

Level 1

Level 2

181000

19700

47000

Hindi Query 6

Level 1

Level 2

Influenced Hindi Query Google

In English Hindi Query Level 1 Level 2 Search Results

Health and blood donation

सव सथ औय यकतद न

हलथ औय यकतद न

हलथ औय जरि_िोनिन

1520 8360 1020

Treatment for Blood pressure

यकतच ऩ क इर ज

जरिपरिय क इर ज

जरिपरिय क टरीटभट

95800 37200 2150

Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770

Government Employment Policy

सयक य दव य य जग य म जन

सयक य दव य य जग य टकीभ

सयक य दव य एमपतरॉमभट सकीभ

1840000 51600 93

Foreign investment in

India

षवद श तनव श ब यत भ

पायन तनव श ब यत भ

पायनइनवटटभट ब यत भ

658000 527 66

Corruption free India

भरषटट च य भ कत ब यत

कयतिन भ कत ब यत

कयतिन फरी इॊडिमा

181000 19700 47000

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 54: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 139

From the above table and figure it is evident that documents are returned for

original as well as transformed Hindi query and the quantity of the retrieved

documents is quite considerable In case of search engines the quality of results is

more important than the quantity therefore Table 428 and figure 45 are presented

below for the analysis of precision values Three popular search engines namely

Google Bing and Alta Vista are used for retrieving web results

Table 428 Analysis of precision values Tabular representation

Hindi

Query

Influenced Hindi Query Precision 10

Level 1 Level 2 Google Bing AltaVista

सव सथऔययकतद न

ह लथऔययकतद न

ह लथऔयबरड_ड न शन

09 09 09 09 08 08 09 09 08

यकतच ऩक इर ज

बरडपर शयक इर ज

बरडपर शयक टरीटभट

08 09 07 08 07 05 08 07 05

रदमचचककतसक

ह टमचचककतसक

क रड मम र िजसट 09 08 1 09 07 1 09 07 1

सयक यदव य य जग यम जन

सयक यदव य य जग यसकीभ

सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08

षवद श तनव शब यतभ

प य नतनव शब यतभ

प य नइनव सटभटब यतभ

09 08 08 09 08 04 09 09 04

भरषटट च यभ कतब यत

कयपशनभ कतब यत

कयपशनफरीइ रडम

1 1 1 1 1 1 1 1 1

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 55: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 140

Figure 45 Analysis of Inter precision values

From the above tables and figures it can be clearly seen that Hindi data of similar

nature can be mined out against Hindi queries by transforming them into their

variants by including English keywords written in Hindi The transformation of

queries resulted in an increase of retrieved data The relevance of the retrieved

data can also be seen in the precision column For every Hindi query and its

transformed variations the degree of relevance of documents is very close or

equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10

documents are relevant and for transformed queries which are of similar nature

ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are

relevant and the same repeats for rest of the Hindi queries as shown in the table

above Without transformation of Hindi queries the user may miss the chance of

retrieving the relevant information as the Hindi user may not be aware of the

presence of such information on web and may be unable to formulate the

variation query based on the factor of English influence From the above table it

can be said that English influenced Hindi information is present and is increasing

day by day on web By the inclusion of the English keywords in Hindi script in

the form of query the scope of searching in Hindi and getting relevant

information can be increased

09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1

09 08 08 08 07 0509 07

1 08 08 08 09 08 041 1 1

09 09 08 08 0705

0907

106 07 08 09 09

04

1 1 1

HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2

Inter Precession Chart HQ (Hindi Query) L (Level)

Google Bing AltaVista

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]

Page 56: Chapter 4 Issues in Information Retrieval for Hindi Language

Chapter 4 Issues in Information Retrieval for Hindi Language

A Study of Web Mining Tools for Query Optimization Page 141

The process of Hindi IR becomes more difficult because of the structure of

Hindi Language Generally people do not follow the actual Hindi writing standard

which widens the gap between Hindi web data and users

The relevant information can be mined out by transforming the Hindi queries

Search engines neither make transformations of the query nor find keyword

equivalents Because they may have the performance and throughput problems if

parameters like Hindi Phonetics synonyms and English equivalent Hindi

keywords are implemented at root level However this problem can be solved at

interface level Therefore to lessen the efforts of a Hindi user to search such

information a software has been developed (a detailed description has been

mentioned in next Chapter) which acts like an interface between user and search

engines With the help of this tool user can widen the scope of search on web in

Hindi language [66][67]