Chapter 4 Issues in Information Retrieval for Hindi Language
Transcript of Chapter 4 Issues in Information Retrieval for Hindi Language
![Page 1: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/1.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 86
Chapter 4
Issues in Information Retrieval for Hindi Language
41 Background
Hindi is the third most widely-spoken language in the world (after English
and Mandarin) an estimated 500-600 million people speak the language A direct
descendant of Sanskrit through Prakrit and Apabhramsha Hindi belongs to the
Indo-Aryan group of languages a subset of the Indo-European family It has been
influenced and enriched by Persian Turkish Farsi Arabic Portuguese and
English Hindi is broadly identical with Urdu the official language of Pakistan
and is closely related to Bengali Punjabi and Guajarati A good knowledge of
Hindi is therefore likely to be useful to anyone having an interest in the countries
of South Asia or in the numerous South Asian communities of the world
There are no particular difficulties in the study of the language Hindi
inherited its writing system from Sanskrit The script Devanagari is extremely
logical and therefore straightforward and easy to learn Pronunciation is easy
because unlike English letters are always pronounced exactly the same way It
can be used for both exact and rational reasoning and the expressive form suited
for poetry and songs
The general appearance of the Devanagari script is that of letters hanging
from a line This line also found in many other South Asian scripts is actually a
part of most of the letters and is drawn as the writing proceeds The script has no
capital letters
Hindi is the official language of the Republic of India and the common second
language of Mauritius Fiji Trinidad Guyana and Surinam
The Hindi alphabet consists of 11 vowels and 33 consonants
The Devanagari script used for Hindi is derived from the ancient Brahmi
and is closely related to other Indian scripts such as Gujarati and Bengali
Hindi was originally a variety of Hindustani spoken in the area of New
Delhi
There are hundreds of Hindi dialects
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 87
The Hindi language has been enriched by Persian Turkish Farsi Arabic
Portuguese and English
Today Hindi is widely spoken in South Asia (India Pakistan Nepal and
Bhutan) South Africa Mauritius the USA Trinidad Fiji Surinam Guyana
Yemen Uganda New Zealand Malaysia and Singapore [58]
42 Characteristics of Hindi Language and Devanagri Script
Hindi is written using the Devanagari script Devanagari is also used to
write other languages such as Nepali and Marathi and is the most common script
used to write Sanskrit Several other languages have scripts which are related to
Devanagari such as Bengali Punjabi and Gujarati
The Devanagari script represents the sounds of the Hindi language with
remarkable consistency Whereas many letters of the English alphabet can be
pronounced many different ways the letters of the Devanagari script are
pronounced consistently (with a few minor exceptions) Thus Devanagari is
relatively easy to learn
Devanagari consists of 11 vowels and 33 consonants and is written from left to
right
421 Basic Genius
Devanagari is not actually an alphabet but a so-called alphasyllabary An
alphasyllabary is a writing system which is primarily based on consonants and in
which vowel symbols are requisite yet secondary As such the fundamental
genius of Devanagari is that every letter represents a consonant which is followed
by an inherent schwa vowel अ For example the letter सis read sa In order to
suppress the inherent vowel one of two methods is required a diacritical mark
called a halant or a ligature called a conjunct In order to indicate a vowel other
than the inherent vowel diacritical marks called maatraas are used For vowels
independent of consonants there exist full letters to transcribe vowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 88
422 Vowels
Hindi has 11 vowels 10 vowels are transcribed in two distinct forms the
independent form and the dependent (maatraa) form The independent form is
used when the vowel letter appears alone at the beginning of a word or
immediately following another vowel letter The dependent form is used when the
vowel follows a consonant
Vowels in Independent Form
अआइईउऊऋएऐओऔ
The following table lists the vowel in its independent form and its description
The best way to learn the pronunciation is to learn from a native speaker
Vowels
Vowel Description
अ as in but again
आ as in father far
इ as in fit hit
ई as in feet heat
उ as in put pull
ऊ as in pool shoot
ऋ as is rip rib
ए as in ate day
ऐ as in man bat
ओ as in go boat
औ as in saw taught
Table 41 lists the vowel in its independent form and its description
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 89
423 Vowels in Dependent (maatraa) Form
When a vowel follows a consonant it is written in its respective maatraa
form which is appended to the consonant Matraa forms never appear at the
beginning of a word or after another vowel The first vowel अ has no particular
maatraa form Instead it is the default vowel It is assumed to be present unless the
maatraa form of another vowel is explicitly appended to a consonant In Sanskrit
the vowel अis pronounced at the end of a word In Hindi however it is not
pronounced except at the end of single-letter words The following table lists
each vowel in its independent form its corresponding dependent form and how it
would appear with the consonant क (k)
Independent Dependent With क
अ (none) क
आ ा क
इ िा कक
ई ा की
उ ा क
ऊ ा क
ऋ ा क
ए ा क
ऐ ा क
ओ ा क
औ ा क
Table 42 Maatraa Forms of Vowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 90
424 Allophones
As mentioned earlier the distinction between the vowels इand ईis the
duration of the pronunciation of the vowel - the former is shorter and the latter
longer However in practice the vowel इis pronounced more like the English i
as in the word it as described in the corresponding text The same is so for the
vowels उand ऊ
425 Final Schwa
The schwa अ is normally not pronounced at the end of a word Thus
क नis pronounced kaan not kaana An exception occurs when a word ends in
a conjunct In this case the word may be pronounced with a slight final schwa as
in मभतर literally mitr but often pronounced like mitr(a) with a soft final
schwa
426 Monophthongs versus Diphthongs
Native English speakers should be careful not to pronounce the Hindi
vowels that are monophthongs as diphthongs For instance ओis a pure sound not
a glide like the English o as in the word low Many vowel letters in English
can represent diphthongs Thus whereas English may represent a diphthong with
the letter i as in the word site in Devanagari this diphthong would be more
precisely transcribed as two monopthongs आand ई स ईट
427 Schwa Syncope
Sometimes the inherent vowel is not pronounced despite its implicit
presence and the lack of any modifying diacritic This phenomenon is called
schwa syncope or alternatively schwa deletion For instance consider the word
नभकीन literally namakeen The second inherent vowel is not pronounced as if
the word were written नमकीन ( namkeen) There is no rule which can predict
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 91
this phenomenon with absolute accuracy yet one generally useful heuristic is that
the inherent vowel is deleted after a consonant which is between two vocalic
consonants Thus the word द वन गयीitself is pronounced with the first schwa
deleted like Devnagari and not Devanagari even though it is still
transliterated as Devanagari
Occasionally the schwa will not be totally deleted but will be very slightly
pronounced
428 Schwa Pronunciation in Context
The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is
similar to the English e as in the word bed but only in certain contexts
namely when two अvowels appear on both sides of the consonant ह as in the
verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such
circumstances Thus although the phrase ऩहनर is literally pahan lo it is often
pronounced pehen lo Occasionally however this phenomenon occurs when
only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In
this case both vowels adjacent to हare converted to [ɛ] and thus although the
word is literally bahin it is pronounced behen
429 Nasalization of Vowels
All vowels in Hindi can be nasalized except for ऋ Nasalization is
indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is
called bindu (dot) and the latter symbol is called chandrabindu (moon and
dot) The bindu is used when part or the entire vowel symbol extends above the
horizontal line The chandrabindu is used when no part of the vowel symbol
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 92
extends above the horizontal line The bindu is more common is modern written
Hindi and may even be used exclusively
The following examples summarize the use of the bindu and chandrabindu
अ आ इ ईउ ऊ ए ऐ ओ औ
क क कक की क क क क कोकौ
A special diacritic is sometimes used with the vowel आto transcribe the English
o vowel sound as in college कॉर ज
4210 Consonants Velar Consonants
Letter Description
क unaspirated k
ख aspirated k
ग unaspirated g
घ aspirated g
ङ n as in sing
Table 43 Consonants Velar Consonants
Note that the velar nasal consonant does not appear as the first letter of any word
4211 Palatal Consonants
Letter Description
च Un-aspirated ch as in
cheese
छ aspirated ch
ज Un-aspirated j
झ aspirated j
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 2: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/2.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 87
The Hindi language has been enriched by Persian Turkish Farsi Arabic
Portuguese and English
Today Hindi is widely spoken in South Asia (India Pakistan Nepal and
Bhutan) South Africa Mauritius the USA Trinidad Fiji Surinam Guyana
Yemen Uganda New Zealand Malaysia and Singapore [58]
42 Characteristics of Hindi Language and Devanagri Script
Hindi is written using the Devanagari script Devanagari is also used to
write other languages such as Nepali and Marathi and is the most common script
used to write Sanskrit Several other languages have scripts which are related to
Devanagari such as Bengali Punjabi and Gujarati
The Devanagari script represents the sounds of the Hindi language with
remarkable consistency Whereas many letters of the English alphabet can be
pronounced many different ways the letters of the Devanagari script are
pronounced consistently (with a few minor exceptions) Thus Devanagari is
relatively easy to learn
Devanagari consists of 11 vowels and 33 consonants and is written from left to
right
421 Basic Genius
Devanagari is not actually an alphabet but a so-called alphasyllabary An
alphasyllabary is a writing system which is primarily based on consonants and in
which vowel symbols are requisite yet secondary As such the fundamental
genius of Devanagari is that every letter represents a consonant which is followed
by an inherent schwa vowel अ For example the letter सis read sa In order to
suppress the inherent vowel one of two methods is required a diacritical mark
called a halant or a ligature called a conjunct In order to indicate a vowel other
than the inherent vowel diacritical marks called maatraas are used For vowels
independent of consonants there exist full letters to transcribe vowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 88
422 Vowels
Hindi has 11 vowels 10 vowels are transcribed in two distinct forms the
independent form and the dependent (maatraa) form The independent form is
used when the vowel letter appears alone at the beginning of a word or
immediately following another vowel letter The dependent form is used when the
vowel follows a consonant
Vowels in Independent Form
अआइईउऊऋएऐओऔ
The following table lists the vowel in its independent form and its description
The best way to learn the pronunciation is to learn from a native speaker
Vowels
Vowel Description
अ as in but again
आ as in father far
इ as in fit hit
ई as in feet heat
उ as in put pull
ऊ as in pool shoot
ऋ as is rip rib
ए as in ate day
ऐ as in man bat
ओ as in go boat
औ as in saw taught
Table 41 lists the vowel in its independent form and its description
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 89
423 Vowels in Dependent (maatraa) Form
When a vowel follows a consonant it is written in its respective maatraa
form which is appended to the consonant Matraa forms never appear at the
beginning of a word or after another vowel The first vowel अ has no particular
maatraa form Instead it is the default vowel It is assumed to be present unless the
maatraa form of another vowel is explicitly appended to a consonant In Sanskrit
the vowel अis pronounced at the end of a word In Hindi however it is not
pronounced except at the end of single-letter words The following table lists
each vowel in its independent form its corresponding dependent form and how it
would appear with the consonant क (k)
Independent Dependent With क
अ (none) क
आ ा क
इ िा कक
ई ा की
उ ा क
ऊ ा क
ऋ ा क
ए ा क
ऐ ा क
ओ ा क
औ ा क
Table 42 Maatraa Forms of Vowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 90
424 Allophones
As mentioned earlier the distinction between the vowels इand ईis the
duration of the pronunciation of the vowel - the former is shorter and the latter
longer However in practice the vowel इis pronounced more like the English i
as in the word it as described in the corresponding text The same is so for the
vowels उand ऊ
425 Final Schwa
The schwa अ is normally not pronounced at the end of a word Thus
क नis pronounced kaan not kaana An exception occurs when a word ends in
a conjunct In this case the word may be pronounced with a slight final schwa as
in मभतर literally mitr but often pronounced like mitr(a) with a soft final
schwa
426 Monophthongs versus Diphthongs
Native English speakers should be careful not to pronounce the Hindi
vowels that are monophthongs as diphthongs For instance ओis a pure sound not
a glide like the English o as in the word low Many vowel letters in English
can represent diphthongs Thus whereas English may represent a diphthong with
the letter i as in the word site in Devanagari this diphthong would be more
precisely transcribed as two monopthongs आand ई स ईट
427 Schwa Syncope
Sometimes the inherent vowel is not pronounced despite its implicit
presence and the lack of any modifying diacritic This phenomenon is called
schwa syncope or alternatively schwa deletion For instance consider the word
नभकीन literally namakeen The second inherent vowel is not pronounced as if
the word were written नमकीन ( namkeen) There is no rule which can predict
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 91
this phenomenon with absolute accuracy yet one generally useful heuristic is that
the inherent vowel is deleted after a consonant which is between two vocalic
consonants Thus the word द वन गयीitself is pronounced with the first schwa
deleted like Devnagari and not Devanagari even though it is still
transliterated as Devanagari
Occasionally the schwa will not be totally deleted but will be very slightly
pronounced
428 Schwa Pronunciation in Context
The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is
similar to the English e as in the word bed but only in certain contexts
namely when two अvowels appear on both sides of the consonant ह as in the
verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such
circumstances Thus although the phrase ऩहनर is literally pahan lo it is often
pronounced pehen lo Occasionally however this phenomenon occurs when
only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In
this case both vowels adjacent to हare converted to [ɛ] and thus although the
word is literally bahin it is pronounced behen
429 Nasalization of Vowels
All vowels in Hindi can be nasalized except for ऋ Nasalization is
indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is
called bindu (dot) and the latter symbol is called chandrabindu (moon and
dot) The bindu is used when part or the entire vowel symbol extends above the
horizontal line The chandrabindu is used when no part of the vowel symbol
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 92
extends above the horizontal line The bindu is more common is modern written
Hindi and may even be used exclusively
The following examples summarize the use of the bindu and chandrabindu
अ आ इ ईउ ऊ ए ऐ ओ औ
क क कक की क क क क कोकौ
A special diacritic is sometimes used with the vowel आto transcribe the English
o vowel sound as in college कॉर ज
4210 Consonants Velar Consonants
Letter Description
क unaspirated k
ख aspirated k
ग unaspirated g
घ aspirated g
ङ n as in sing
Table 43 Consonants Velar Consonants
Note that the velar nasal consonant does not appear as the first letter of any word
4211 Palatal Consonants
Letter Description
च Un-aspirated ch as in
cheese
छ aspirated ch
ज Un-aspirated j
झ aspirated j
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 3: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/3.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 88
422 Vowels
Hindi has 11 vowels 10 vowels are transcribed in two distinct forms the
independent form and the dependent (maatraa) form The independent form is
used when the vowel letter appears alone at the beginning of a word or
immediately following another vowel letter The dependent form is used when the
vowel follows a consonant
Vowels in Independent Form
अआइईउऊऋएऐओऔ
The following table lists the vowel in its independent form and its description
The best way to learn the pronunciation is to learn from a native speaker
Vowels
Vowel Description
अ as in but again
आ as in father far
इ as in fit hit
ई as in feet heat
उ as in put pull
ऊ as in pool shoot
ऋ as is rip rib
ए as in ate day
ऐ as in man bat
ओ as in go boat
औ as in saw taught
Table 41 lists the vowel in its independent form and its description
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 89
423 Vowels in Dependent (maatraa) Form
When a vowel follows a consonant it is written in its respective maatraa
form which is appended to the consonant Matraa forms never appear at the
beginning of a word or after another vowel The first vowel अ has no particular
maatraa form Instead it is the default vowel It is assumed to be present unless the
maatraa form of another vowel is explicitly appended to a consonant In Sanskrit
the vowel अis pronounced at the end of a word In Hindi however it is not
pronounced except at the end of single-letter words The following table lists
each vowel in its independent form its corresponding dependent form and how it
would appear with the consonant क (k)
Independent Dependent With क
अ (none) क
आ ा क
इ िा कक
ई ा की
उ ा क
ऊ ा क
ऋ ा क
ए ा क
ऐ ा क
ओ ा क
औ ा क
Table 42 Maatraa Forms of Vowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 90
424 Allophones
As mentioned earlier the distinction between the vowels इand ईis the
duration of the pronunciation of the vowel - the former is shorter and the latter
longer However in practice the vowel इis pronounced more like the English i
as in the word it as described in the corresponding text The same is so for the
vowels उand ऊ
425 Final Schwa
The schwa अ is normally not pronounced at the end of a word Thus
क नis pronounced kaan not kaana An exception occurs when a word ends in
a conjunct In this case the word may be pronounced with a slight final schwa as
in मभतर literally mitr but often pronounced like mitr(a) with a soft final
schwa
426 Monophthongs versus Diphthongs
Native English speakers should be careful not to pronounce the Hindi
vowels that are monophthongs as diphthongs For instance ओis a pure sound not
a glide like the English o as in the word low Many vowel letters in English
can represent diphthongs Thus whereas English may represent a diphthong with
the letter i as in the word site in Devanagari this diphthong would be more
precisely transcribed as two monopthongs आand ई स ईट
427 Schwa Syncope
Sometimes the inherent vowel is not pronounced despite its implicit
presence and the lack of any modifying diacritic This phenomenon is called
schwa syncope or alternatively schwa deletion For instance consider the word
नभकीन literally namakeen The second inherent vowel is not pronounced as if
the word were written नमकीन ( namkeen) There is no rule which can predict
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 91
this phenomenon with absolute accuracy yet one generally useful heuristic is that
the inherent vowel is deleted after a consonant which is between two vocalic
consonants Thus the word द वन गयीitself is pronounced with the first schwa
deleted like Devnagari and not Devanagari even though it is still
transliterated as Devanagari
Occasionally the schwa will not be totally deleted but will be very slightly
pronounced
428 Schwa Pronunciation in Context
The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is
similar to the English e as in the word bed but only in certain contexts
namely when two अvowels appear on both sides of the consonant ह as in the
verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such
circumstances Thus although the phrase ऩहनर is literally pahan lo it is often
pronounced pehen lo Occasionally however this phenomenon occurs when
only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In
this case both vowels adjacent to हare converted to [ɛ] and thus although the
word is literally bahin it is pronounced behen
429 Nasalization of Vowels
All vowels in Hindi can be nasalized except for ऋ Nasalization is
indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is
called bindu (dot) and the latter symbol is called chandrabindu (moon and
dot) The bindu is used when part or the entire vowel symbol extends above the
horizontal line The chandrabindu is used when no part of the vowel symbol
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 92
extends above the horizontal line The bindu is more common is modern written
Hindi and may even be used exclusively
The following examples summarize the use of the bindu and chandrabindu
अ आ इ ईउ ऊ ए ऐ ओ औ
क क कक की क क क क कोकौ
A special diacritic is sometimes used with the vowel आto transcribe the English
o vowel sound as in college कॉर ज
4210 Consonants Velar Consonants
Letter Description
क unaspirated k
ख aspirated k
ग unaspirated g
घ aspirated g
ङ n as in sing
Table 43 Consonants Velar Consonants
Note that the velar nasal consonant does not appear as the first letter of any word
4211 Palatal Consonants
Letter Description
च Un-aspirated ch as in
cheese
छ aspirated ch
ज Un-aspirated j
झ aspirated j
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 4: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/4.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 89
423 Vowels in Dependent (maatraa) Form
When a vowel follows a consonant it is written in its respective maatraa
form which is appended to the consonant Matraa forms never appear at the
beginning of a word or after another vowel The first vowel अ has no particular
maatraa form Instead it is the default vowel It is assumed to be present unless the
maatraa form of another vowel is explicitly appended to a consonant In Sanskrit
the vowel अis pronounced at the end of a word In Hindi however it is not
pronounced except at the end of single-letter words The following table lists
each vowel in its independent form its corresponding dependent form and how it
would appear with the consonant क (k)
Independent Dependent With क
अ (none) क
आ ा क
इ िा कक
ई ा की
उ ा क
ऊ ा क
ऋ ा क
ए ा क
ऐ ा क
ओ ा क
औ ा क
Table 42 Maatraa Forms of Vowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 90
424 Allophones
As mentioned earlier the distinction between the vowels इand ईis the
duration of the pronunciation of the vowel - the former is shorter and the latter
longer However in practice the vowel इis pronounced more like the English i
as in the word it as described in the corresponding text The same is so for the
vowels उand ऊ
425 Final Schwa
The schwa अ is normally not pronounced at the end of a word Thus
क नis pronounced kaan not kaana An exception occurs when a word ends in
a conjunct In this case the word may be pronounced with a slight final schwa as
in मभतर literally mitr but often pronounced like mitr(a) with a soft final
schwa
426 Monophthongs versus Diphthongs
Native English speakers should be careful not to pronounce the Hindi
vowels that are monophthongs as diphthongs For instance ओis a pure sound not
a glide like the English o as in the word low Many vowel letters in English
can represent diphthongs Thus whereas English may represent a diphthong with
the letter i as in the word site in Devanagari this diphthong would be more
precisely transcribed as two monopthongs आand ई स ईट
427 Schwa Syncope
Sometimes the inherent vowel is not pronounced despite its implicit
presence and the lack of any modifying diacritic This phenomenon is called
schwa syncope or alternatively schwa deletion For instance consider the word
नभकीन literally namakeen The second inherent vowel is not pronounced as if
the word were written नमकीन ( namkeen) There is no rule which can predict
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 91
this phenomenon with absolute accuracy yet one generally useful heuristic is that
the inherent vowel is deleted after a consonant which is between two vocalic
consonants Thus the word द वन गयीitself is pronounced with the first schwa
deleted like Devnagari and not Devanagari even though it is still
transliterated as Devanagari
Occasionally the schwa will not be totally deleted but will be very slightly
pronounced
428 Schwa Pronunciation in Context
The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is
similar to the English e as in the word bed but only in certain contexts
namely when two अvowels appear on both sides of the consonant ह as in the
verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such
circumstances Thus although the phrase ऩहनर is literally pahan lo it is often
pronounced pehen lo Occasionally however this phenomenon occurs when
only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In
this case both vowels adjacent to हare converted to [ɛ] and thus although the
word is literally bahin it is pronounced behen
429 Nasalization of Vowels
All vowels in Hindi can be nasalized except for ऋ Nasalization is
indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is
called bindu (dot) and the latter symbol is called chandrabindu (moon and
dot) The bindu is used when part or the entire vowel symbol extends above the
horizontal line The chandrabindu is used when no part of the vowel symbol
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 92
extends above the horizontal line The bindu is more common is modern written
Hindi and may even be used exclusively
The following examples summarize the use of the bindu and chandrabindu
अ आ इ ईउ ऊ ए ऐ ओ औ
क क कक की क क क क कोकौ
A special diacritic is sometimes used with the vowel आto transcribe the English
o vowel sound as in college कॉर ज
4210 Consonants Velar Consonants
Letter Description
क unaspirated k
ख aspirated k
ग unaspirated g
घ aspirated g
ङ n as in sing
Table 43 Consonants Velar Consonants
Note that the velar nasal consonant does not appear as the first letter of any word
4211 Palatal Consonants
Letter Description
च Un-aspirated ch as in
cheese
छ aspirated ch
ज Un-aspirated j
झ aspirated j
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 5: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/5.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 90
424 Allophones
As mentioned earlier the distinction between the vowels इand ईis the
duration of the pronunciation of the vowel - the former is shorter and the latter
longer However in practice the vowel इis pronounced more like the English i
as in the word it as described in the corresponding text The same is so for the
vowels उand ऊ
425 Final Schwa
The schwa अ is normally not pronounced at the end of a word Thus
क नis pronounced kaan not kaana An exception occurs when a word ends in
a conjunct In this case the word may be pronounced with a slight final schwa as
in मभतर literally mitr but often pronounced like mitr(a) with a soft final
schwa
426 Monophthongs versus Diphthongs
Native English speakers should be careful not to pronounce the Hindi
vowels that are monophthongs as diphthongs For instance ओis a pure sound not
a glide like the English o as in the word low Many vowel letters in English
can represent diphthongs Thus whereas English may represent a diphthong with
the letter i as in the word site in Devanagari this diphthong would be more
precisely transcribed as two monopthongs आand ई स ईट
427 Schwa Syncope
Sometimes the inherent vowel is not pronounced despite its implicit
presence and the lack of any modifying diacritic This phenomenon is called
schwa syncope or alternatively schwa deletion For instance consider the word
नभकीन literally namakeen The second inherent vowel is not pronounced as if
the word were written नमकीन ( namkeen) There is no rule which can predict
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 91
this phenomenon with absolute accuracy yet one generally useful heuristic is that
the inherent vowel is deleted after a consonant which is between two vocalic
consonants Thus the word द वन गयीitself is pronounced with the first schwa
deleted like Devnagari and not Devanagari even though it is still
transliterated as Devanagari
Occasionally the schwa will not be totally deleted but will be very slightly
pronounced
428 Schwa Pronunciation in Context
The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is
similar to the English e as in the word bed but only in certain contexts
namely when two अvowels appear on both sides of the consonant ह as in the
verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such
circumstances Thus although the phrase ऩहनर is literally pahan lo it is often
pronounced pehen lo Occasionally however this phenomenon occurs when
only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In
this case both vowels adjacent to हare converted to [ɛ] and thus although the
word is literally bahin it is pronounced behen
429 Nasalization of Vowels
All vowels in Hindi can be nasalized except for ऋ Nasalization is
indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is
called bindu (dot) and the latter symbol is called chandrabindu (moon and
dot) The bindu is used when part or the entire vowel symbol extends above the
horizontal line The chandrabindu is used when no part of the vowel symbol
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 92
extends above the horizontal line The bindu is more common is modern written
Hindi and may even be used exclusively
The following examples summarize the use of the bindu and chandrabindu
अ आ इ ईउ ऊ ए ऐ ओ औ
क क कक की क क क क कोकौ
A special diacritic is sometimes used with the vowel आto transcribe the English
o vowel sound as in college कॉर ज
4210 Consonants Velar Consonants
Letter Description
क unaspirated k
ख aspirated k
ग unaspirated g
घ aspirated g
ङ n as in sing
Table 43 Consonants Velar Consonants
Note that the velar nasal consonant does not appear as the first letter of any word
4211 Palatal Consonants
Letter Description
च Un-aspirated ch as in
cheese
छ aspirated ch
ज Un-aspirated j
झ aspirated j
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 6: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/6.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 91
this phenomenon with absolute accuracy yet one generally useful heuristic is that
the inherent vowel is deleted after a consonant which is between two vocalic
consonants Thus the word द वन गयीitself is pronounced with the first schwa
deleted like Devnagari and not Devanagari even though it is still
transliterated as Devanagari
Occasionally the schwa will not be totally deleted but will be very slightly
pronounced
428 Schwa Pronunciation in Context
The Hindi inherent vowel अ may be pronounced as [ɛ] a vowel which is
similar to the English e as in the word bed but only in certain contexts
namely when two अvowels appear on both sides of the consonant ह as in the
verb ऩहनन (to wear) Both schwa vowels are often pronounced as [ɛ] in such
circumstances Thus although the phrase ऩहनर is literally pahan lo it is often
pronounced pehen lo Occasionally however this phenomenon occurs when
only one schwa vowel is beside the consonant ह as in the word फहहन (sister) In
this case both vowels adjacent to हare converted to [ɛ] and thus although the
word is literally bahin it is pronounced behen
429 Nasalization of Vowels
All vowels in Hindi can be nasalized except for ऋ Nasalization is
indicated by either the symbol ―ा ―or by the symbol ―ा The former symbol is
called bindu (dot) and the latter symbol is called chandrabindu (moon and
dot) The bindu is used when part or the entire vowel symbol extends above the
horizontal line The chandrabindu is used when no part of the vowel symbol
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 92
extends above the horizontal line The bindu is more common is modern written
Hindi and may even be used exclusively
The following examples summarize the use of the bindu and chandrabindu
अ आ इ ईउ ऊ ए ऐ ओ औ
क क कक की क क क क कोकौ
A special diacritic is sometimes used with the vowel आto transcribe the English
o vowel sound as in college कॉर ज
4210 Consonants Velar Consonants
Letter Description
क unaspirated k
ख aspirated k
ग unaspirated g
घ aspirated g
ङ n as in sing
Table 43 Consonants Velar Consonants
Note that the velar nasal consonant does not appear as the first letter of any word
4211 Palatal Consonants
Letter Description
च Un-aspirated ch as in
cheese
छ aspirated ch
ज Un-aspirated j
झ aspirated j
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 7: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/7.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 92
extends above the horizontal line The bindu is more common is modern written
Hindi and may even be used exclusively
The following examples summarize the use of the bindu and chandrabindu
अ आ इ ईउ ऊ ए ऐ ओ औ
क क कक की क क क क कोकौ
A special diacritic is sometimes used with the vowel आto transcribe the English
o vowel sound as in college कॉर ज
4210 Consonants Velar Consonants
Letter Description
क unaspirated k
ख aspirated k
ग unaspirated g
घ aspirated g
ङ n as in sing
Table 43 Consonants Velar Consonants
Note that the velar nasal consonant does not appear as the first letter of any word
4211 Palatal Consonants
Letter Description
च Un-aspirated ch as in
cheese
छ aspirated ch
ज Un-aspirated j
झ aspirated j
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 8: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/8.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 93
Table 44 Palatal Consonants
4212 Retroflex Consonants
Table 45 Retroflex Consonants
Hindi additionally employs two flap consonants डand ढ The symbols for these
consonants are formed by placing a diacritical mark called a nuqta which is a
subscript dot underneath the consonant symbols डand ढrespectively डis
pronounced by flapping the tongue from the retroflex position forward toward the
alveolar ridge ढis pronounced similarly except with aspiration English does
have an alveolar flap consonant as the t in the word better or the d as in
bedding as in American English The Hindi flaps are retroflex however
4213 Dental Consonants
Letter Description
त like t but dental and un-aspirated
ञ n as in punch
Letter Description
ट like t but retroflex and un-
aspirated
ठ like t but retroflex and aspirated
ड like d but retroflex and un-
aspirated
ढ like d but retroflex and aspirated
ण like n but retroflex
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 9: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/9.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 94
थ like t but dental and aspirated
द like d but dental and un-aspirated
ध like d but dental and aspirated
न like n in name but dental
Table 46 Dental Consonants
4214 Labial Consonants
Letter Description
ऩ like p but un-aspirated
प like p but aspirated
फ like b but un-aspirated
ब like b but aspirated
भ m
Table 47 Labial Consonants
4215 Semivowels
Letter Description
म y as in young
य like r but often rolled
र l as in lip
व either w or v
Table 48 Semivowels
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 10: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/10.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 95
The Hindi r sound is typically a flap However some speakers may trill the r
sound occasionally or may even occasionally pronounce it closer to an unflapped
approximant sound as in the English r in red
4216 Sibilants
Letter Description
श sh as in shave
ष like sh but retroflex
स s as in save
Table 49 Sibilants
4217 Glottal
Letter Description
ह like h but voiced
Table 410 Glottal
4218 Allophony of w and v in Hindi
A phoneme is an equivalent class of atomic discrete sounds which can
produce a difference in meaning when spoken yet cannot produce a difference in
meaning when substituted for one another A phone is simply a distinct sound
For instance in English the p in the word spit and in the word pit are
pronounced distinctly the former is aspirated the latter is unaspirated Thus they
are two distinct phones However they are both members of the same phoneme
since substituting one for the other can never produce a difference in meaning
even though substitution may be perceived as slightly awkward by native
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 11: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/11.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 96
speakers Two distinct phones which are both members of the same phoneme are
called allophones (from Greek different sounds)
In Hindi the sounds associated with the English letters w and v are
allophones Both are transcribed with one letter व Aanalogously to the English
example above these sounds are typically pronounced consistently in words but
they do not constitute meaningful differences in utterances For example the
word व is typically pronounced as vo whereas the suffix -व र is typically
pronounced wala Hindi speakers are not generally aware of this distinction
even though they pronounce the distinction fairly consistently just as English
speakers are not aware of the differences of aspiration in certain letters yet
pronounce aspiration consistently
Thus वmay be pronounced as w or v Some speakers may even
pronounce an intermediate sound Semi-Allophones j and z in Hindi
Likewise Hindi speakers do not generally maintain any strict distinction
between the English j and z sounds either but will typically pronounce words
consistently This situation is not quite the same as w and v since technically
the z sound can be represented distinctly from the j sound by placing a dot
(nuqta) underneath the letter and some speakers are aware of this distinction For
instance the word ज is pronounced as jo There is some variation however in
some words such as जम द - some speakers pronounce this as zyada and some
as jyada
4219 English Alveolar Consonants
There is no equivalent of the English t or d in Hindi These English
sounds are pronounced with the tongue on the tip of the alveolar ridge behind the
top teeth This place of articulation is between the Devanagari retroflex and dental
positions although the English pronunciation will sound much closer to the
retroflex pronunciation to Hindi speakers English loanwords containing t or d
are therefore transcribed with retroflex approximations
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 12: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/12.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 97
Capital Letters
Devanagari has no capital letters
Special Matraa Forms of उand ऊwith य
य + उ = र
य + ऊ = र
4220 Borrowed Sounds
There are 6 additional sounds used in Hindi which have no corresponding
symbols in Devanagari These sounds are represented by placing the nuqta
underneath a symbol which is phonetically similar These symbols represent
sounds from other languages such as Persian Arabic and English
42201 Foreign Sounds
Letter Approximation
like k but pronounced in the back of the
mouth
ऽ velar fricative like Bach in German
ा velar sound similar to ऽbut voiced
ज just as English z as in zoo
झ similar to the s in English vision
फ just as English f
Table 411 Foreign Sounds
Only two of the borrowed sounds are typically pronounced distinctly from the
non-nuqta forms though जand फ
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 13: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/13.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 98
42202 Conjuncts
Since any consonant that is not explicitly followed by a vowel symbol is
implicitly followed by the inherent vowel अ Devanagari provides two means of
suppressing the inherent vowel
The halant (ा ) a diacritical subscript eg क
A conjunct a ligature synthesized by conjoining two consonant symbols This
method is much more common The halant is typically only used when
typographical difficulties make it difficult to use conjuncts
42203 Horizontal Conjuncts
Horizontal conjuncts are formed when the first letter of a conjunct
contains a vertical line The vertical line is deleted and then the modified
consonant symbol is conjoined to the second consonant symbol For example
न + द = नद हहनदी
च + छ = चछ अचछ
स + त = सत नभसत
र + र = लर बफलरी
भ + फ = मफ रमफ
फ + त = फत भ फत
क + म = कम कमो
Note that in the last two examples although neither कnor पend in a vertical line
they still can be the first letter of a horizontal conjunct The curve on the right side
is shortened and adjoined to the following consonant
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 14: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/14.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 99
42204 Vertical Conjuncts
Consonants that do not end with a vertical line often form vertical
conjuncts with the following consonant The first consonant is written on top of
the second consonant For example
ट + ट = टट छ टटी
ट + ठ = टठ चचटठी
42205 Other Conjuncts
Certain conjuncts are special and should be observed If a nasal consonant
is the first member of a conjunct it may be written either using a regular
conjunct (eg न + द = नद हहनदी) or an anusvar which is a dot written above
the horizontal line to the right side of the preceding consonant or vowel For
instance हहनदीcould be spelled हह दी and अणड could alternatively be spelled अ ड
Note that the anusvar always indicates a so-called homorganic nasal consonant -
in other words it is articulated in the same location in the mouth as the following
consonant is articulated Thus the anusvar in हह दीmust represent न which is a
dental nasal consonant since द the following letter represents a dental
consonant Likewise the anusvar in अ ड must represent the retroflex nasal
consonant णsince the following consonant ड is a retroflex consonant
Note that the anusvar is not the same as the bindu (or chandrabindu) The anusvar
represents a consonant which is the first letter of a conjunct whereas the bindu
and chandrabindu represent the nasalization of a vowel The bindu in हcannot be
considered an anusvar since there is no conjunct The anusvar in हह दीis not
considered a bindu since it represents a consonant that is the first member of a
conjunct
Conjuncts with य
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 15: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/15.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 100
As the first member of a conjunct यappears like a small hook or sickle above
and to the right of the following consonant
य + भ = भम शभ म
य + ट + ई = टम ऩ टी
As the second member of a conjunct यis indicated by a diagonal line adjoined to
the vertical line of the preceding consonant
क + य = कर श ककरम
भ + य = मर उमर
Four consonants ट ठ ड ढ do not have any vertical line so they indicate a
following यwith the symbol like an inverted v as follows
ट + य = टर य षटटर
Special Conjuncts
Some conjuncts look quite different than their component consonants and are not
obvious Most of these occur in words borrowed from Sanskrit
क + ष =
त + त = तत
त + य = तर
ज + ञ = ऻ
द + द = दद
द + ध = दध
द + म = दम
द + व = दव
श + य = शर
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 16: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/16.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 101
ह + भ = हभ
The conjunct ज + ञ = ऻis pronounced as गम ( gya) in Hindi Conjuncts are
treated as a single unit and a maatraa is placed before the entire conjunct
There are hundreds of conjuncts but most conjuncts are easily discernable
Punctuation
Hindi has one punctuation sign the viraam which is a vertical line which
terminates a sentence Other punctuation such as commas and question marks is
borrowed from English In modern typography periods are also used in placed of
the viraam
[59][60]
43 Unicode and fonts
Computers store characters by assigning a number to each one This
process is known as encoding Most of us are familiar with ASCII which is a 7 bit
encoding of the characters in the English language (it can store at most 128
characters) With the passage of time the need was felt for a single encoding that
could contain enough characters to accommodate all the languages in the world
To enable sharing of information this encoding would need to be a standard
accepted universally That standard is Unicode Unicode is a 32 bit encoding
which can potentially give a unique number to each character in all languages
known to man
Actually there is another international standard the ISO 10646 of the
International Organization for Standardization (ISO) which defines the Universal
Character Set (UCS) Fortunately the participants of both projects (ISO and
Unicode) realized in around 1991 that two different unified character sets is not
exactly what the world needs They joined their efforts and worked together on
creating a single encoding Both projects still exist and publish their respective
standards independently but have agreed to keep the encoding of the Unicode and
ISO 10646 standards compatible
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 17: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/17.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 102
431 Various Encoding Forms
Encoding standards define the numerical value or code point of a
particular character but that is not all They must also define how this value will
be represented in bits when stored in a computer file or transmitted over the
Internet The Unicode Standard defines three encoding forms that define how a
particular character will be represented in bits while being transmitted The three
encoding forms allow the same data to be transmitted in a byte word or double
word oriented format (ie in 8 16 or 32-bits) All three encoding forms encode
the same common character repertoire and can be efficiently transformed into one
another without loss of data The three encoding forms as defined by the Unicode
Consortium are
UTF-8
UTF-8 is popular for HTML and similar protocols UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of bytes It
has the advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII and that Unicode characters
transformed into UTF-8 can be used with much existing software without
extensive software rewrites
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to
characters with economical use of storage It is reasonably compact and all the
heavily used characters fit into a single 16-bit code unit while all other characters
are accessible via pairs of 16-bit code units
UTF-32
UTF-32 is popular where memory space is no concern but fixed width single
code unit access to characters is desired Each Unicode character is encoded in a
single 32-bit code unit when using UTF- 32
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 18: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/18.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 103
By the way UTF stands for UCS Transformation Format
432 UTF-8
UTF-8 has the benefit that the ASCII characters are still represented as a
single byte providing compatibility with file systems parsers and other software
that rely on US-ASCII values but are transparent to other values Any document
created using the ASCII encoding is a valid UTF-8 document
Non-ASCII characters are encoded using a variable length scheme and
may range from 2 to 6 bytes in size however the most commonly used characters
are only up to three bytes long The way that Non-ASCII characters are encoded
is
Non-ASCII characters are encoded as a sequence of several bytes each of
which has the most significant bit set This means that all bytes representing non-
ASCII characters are invalid under ASCII encoding (since all ASCII characters
stored in bytes have their most significant bit not set) This allows the application
to differentiate between ASCII and non-ASCII characters Bytes representing
non-ASCII characters will never be mistaken for ASCII characters
The first byte of a multibyte sequence that represents a non-ASCII
character indicates how many bytes follow for this character All further bytes in
the multibyte sequence are used to encode the actual character [61]
433 Unicode and Devanagari
The scripts of South Asia share so many common features that a side-by-
side comparison of a few will often reveals structural similarities even in the
modern letterforms With minor historical exceptions they are written from left to
right They are all abugidas in which most symbols stand for a consonant plus an
inherent vowel (usually the sound a) Wordinitial vowels in many of these
scripts have distinct symbols and word-internal vowels are usually written by
juxtaposing a vowel sign in the vicinity of the affected consonant Absence of the
inherent vowel when that occurs is frequently marked with a special sign In the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 19: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/19.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 104
Unicode Standard this sign is denoted by the Sanskrit word virZma In some
languages another designation is preferred In Hindi for example the word hal
refers to the character itself and halant refers to the consonant that has its inherent
vowel suppressed in Tamil the word pukki is used The virama sign nominally
serves to suppress the inherent vowel of the consonant to which it is applied it is
a combining character with its shape varying from script to script Most of the
scripts of South Asia from north of the Himalayas to Sri Lanka in the south from
Pakistan in the west to the easternmost islands of Indonesia are derived from the
ancient Brahmi script The oldest lengthy inscriptions of India the edicts of
Ashoka from the third century were written in two scripts Kharoshthi and
Brahmi These are both ultimately of Semitic origin probably deriving from
Aramaic which was an important administrative language of the Middle East at
that time Kharoshthi written from right to left was supplanted by Brahmi and its
derivatives The descendants of Brahmi spread with myriad changes throughout
the subcontinent and outlying islands There are said to be some 200 different
scripts deriving from it By the eleventh century the modern script known as
Devanagari was in ascendancy in India proper as the major script of Sanskrit
literature This northern branch includes such modern scripts as Bengali
Gurmukhi and Tibetan the southern branch includes scripts such as Malayalam
and Tamil The major official scripts of India proper including Devanagari are
all encoded according to a common plan so that comparable characters are in the
same order and relative location This structural arrangement which facilitates
transliteration to some degree is based on the Indian national standard (ISCII)
encoding for these scripts and makes use of a virama Sinhala has a virama-based
model but is not structurally mapped to ISCII Tibetan stands apart using a
subjoined consonant model for conjoined consonants reflecting its somewhat
different structure and usage The Limbu script makes use of an explicit encoding
of syllable-final consonants Many of the character names in this group of scripts
represent the same sounds and naming conventions are similar across the range
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 20: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/20.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 105
434 Devanagari U+0900ndashU+097F
The Devanagari script is used for writing classical Sanskrit and its modern
historical derivative Hindi Extensions to the Sanskrit repertoire are used to write
other related languages of India (such as Marathi) and of Nepal (Nepali) In
addition the Devanagari script is used to write the following languages Awadhi
Bagheli Bhatneri Bhili Bihari Braj Bhasha Chhattisgarhi Garhwali Gondi
(Betul Chhindwara and Mandla dialects) Harauti Ho Jaipuri Kachchhi
Kanauji Konkani Kului Kumaoni Kurku Kurukh Marwari Mundari Newari
Palpa and Santali
All other Indic scripts as well as the Sinhala script of Sri Lanka the Tibetan
script and the Southeast Asian scripts are historically connected with the
Devanagari script as descendants of the ancient Brahmi script The entire family
of scripts shares a large number of structural features The principles of the Indic
scripts are covered in some detail in this introduction to the Devanagari script
The remaining introductions to the Indic scripts are abbreviated but highlight any
differences from Devanagari where appropriate
4341 Standards
The Devanagari block of the Unicode Standard is based on ISCII-1988
(Indian Script Code for Information Interchange) The ISCII standard of 1988
differs from and is an update of earlier ISCII standards issued in 1983 and 1986
The Unicode Standard encodes Devanagari characters in the same relative
positions as those coded in positions A0ndashF416 in the ISCII-1988 standard The
same character code layout is followed for eight other Indic scripts in the Unicode
Standard Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada and
Malayalam This parallel code layout emphasizes the structural similarities of the
Brahmi scripts and follows the stated intention of the Indian coding standards to
enable one-to-one mappings between analogous coding positions in different
scripts in the family Sinhala Tibetan Thai Lao Khmer Myanmar and other
scripts depart to a greater extent from the Devanagari structural pattern so the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 21: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/21.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 106
Unicode Standard does not attempt to provide any direct mappings for these
scripts to the Devanagari order
In November 1991 at the time The Unicode Standard Version 10 was
published the Bureau of Indian Standards published a new version of ISCII in
Indian Standard (IS) 131941991 This new version partially modified the layout
and repertoire of the ISCII- 1988 standard Because of these events the Unicode
Standard does not precisely follow the layout of the current version of ISCII
Nevertheless the Unicode Standard remains a superset of the ISCII-1991
repertoire except for a number of new Vedic extension characters defined in IS
131941991 Annex GmdashExtended Character Set for Vedic Modern non-Vedic
texts encoded with ISCII-1991 may be automatically converted to Unicode code
points and back to their original encoding without loss of information
4342 Encoding Principles
The writing systems that employ Devanagari and other Indic scripts
constitute abugidasmdasha cross between syllabic writing systems and alphabetic
writing systems The effective unit of these writing systems is the orthographic
syllable consisting of a consonant and vowel (CV) core and optionally one or
more preceding consonants with a canonical structure of (((C)C)C)V The
orthographic syllable need not correspond exactly with a phonological syllable
especially when a consonant cluster is involved but the writing system is built on
phonological principles and tends to correspond quite closely to pronunciation
The orthographic syllable is built up of alphabetic pieces the actual letters of the
Devanagari script These pieces consist of three distinct character types
consonant letters independent vowels and dependent vowel signs In a text
sequence these characters are stored in logical (phonetic) order [62]
44 Indian Languages on internet
Rise of Hindi Urdu and other Indian languages on the Web has lead
millions of non-English speaking Indians to discover uses of the Internet in their
daily lives They are sending and receiving e-mails searching for information
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 22: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/22.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 107
reading e-papers blogging and launching Web sites in their own languages Two
American IT companies Microsoft and Google have played a big role in making
this possible
A decade ago there were many problems involved in using Indian languages on
the Internet ―There was mismatch of fonts and keyboard layouts which made it
impossible to read any Hindi document if the user did not have the same fonts
There was chaos more than 50 fonts and 20 keyboards were being used and if
two users were following different styles there was no way to read the other
personlsquos documents But the advent of Unicode support for Hindi and Urdu
changed all that The concept of new character encoding from Unicode
Consortiummdasha nonprofit in California whose members include Google IBM
Oracle Microsoft Sun MicrosystemsYahooand the Government of Indiamdash
proved to be a boon for Indian languages Microsoft incorporated the Hindi
Unicode font Mangal in its operating system in 2001 ―Since then the Hindi
Unicode support has been a part of all subsequent up gradations of Microsoftlsquos
operating systems Also providing Input Method Editor Facilities give users the
option to use different types of keyboards says Meghashyam Karanam product
manager vision and localization at Microsoft India The earlier system could
incorporate only 127 characters which is not enough for the Hindi
Devnagariscript The Unicode system can incorporate up to 65000 characters As
most computers in India use Microsoftlsquos operating system it ensured that the
Hindi font was available to most of them as they upgraded the operating software
In 2004 the Hindi version of Microsoft Office 2003 which included Word
Excel PowerPoint and Outlook was launched Now the Hindi version of
Microsoft Office 2007 is also available ―It includes Hindi language interface
packs that allow users to create documents and communicate with others in Hindi
Users can also navigate using the menus and toolbars that are in Hindi We have
received a very good response from the Hindi users says Karanam Urdu
language support is available in Windows Vista and Office 2007 Another
Microsoft initiative is Project Bhasha which was launched in 2003 and now
provides support to 13 Indian languages such as Hindi Tamil Kannada Punjabi
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 23: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/23.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 108
Konkani Oriya In 2006 Microsoft headquartered in Redmond in Washington
State partnered with one of the early Hindi portals webduniacom to launch its
MSN Hindi portal ―Webdunia also provided support for the Hindi version of
Microsoft Office as well as for language interface packs says Jaideep Karnik
general manager for content and localization at webduniacom The Indore
Madhya Pradesh-based company has an office in the United States and helps
major software developers localize their products If Microsoft built the base for
Hindi Google was ready to put up the superstructure Realizing the potential of
Indian languages the California-based company has launched various products in
the past two years With the Google Hindi and Urdu search engines one can
search all the Hindi and Urdu Web pages available on the Internet including
those that are not in Unicode font ―Google offers searching in 13 languages
Hindi Tamil Kannada Malayalam and Telugu to name a few Gmail in five
languages and Google transliteration in Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu Urdu is the most
recent language that Google has added to its offerings says Rahul Roy-
Chowdhury product manager at Google India To use the search function ―users
can type Hindi words in Roman script and a drop down menu suggests several
Hindi phrases By selecting the appropriate query users can search for Hindi
content without even typing in Hindi says Roy-Chowdhury Google has more
useful tools for non-English users Google News is available in Hindi With the
Google translation engine one can type English words and get a list of suggested
synonyms in Hindi A transliteration tool allows users to type any word in
English hit the space bar and get the same word in a different language Roy-
Chowdhury explains the process of adding a new language
―Google offers products first in Google Labs and waits for feedback from users
for a couple of months Then the feedback is collated and the product is updated
before introducing the language with its other offerings like Gmail Search
Blogger Translate and Orkut to name a few ―Urdu is currently available in
Googlelsquos transliteration offering on the Google Labs Web site and the language is
soon to be introduced in various other products he adds The efforts of
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 24: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/24.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 109
Microsoft Google and other developers have begun to produce results Page
views of major Hindi news Web sites are rising fast and most of the popular Hindi
newspapers have a Web presence now ―In the last two years page views of
navbharattimescom have increased significantly and half of them come through
Google as Net users generally search for a specific news item or query says
Nagar Yahoo with headquarters in California formed a partnership with Dainik
Jagran a year and a half ago for the newspaperlsquos Hindi portal ―The Jagran
relationship helps us gain significant traction among Indian Internet users From
all the audience measures for this product this has been a resounding success
says Gopal Krishna head of Yahoolsquos work for emerging market audiences Since
Yahoo and Jagran started working together page views have ―grown to about 14
million from one million a year and a half earlier says Upendra Swami who
heads the Internet team at Jagran Hindi Wikipedia hosted by the nonprofit
Wikimedia Foundation is also gaining popularity Started in July 2003 Hindi
Wikipedia now has more than 36000 articles ―It now appears to be the 52nd
largest Wikipedia in size compared to the over 260 individual language
Wikipedias says Jay Walsh head of communications at the California-based
Wikimedia Foundation ―Considering there are millions of Hindi speakers it is
certainly an important part of the Wikimedia Foundationlsquos mission to support the
growth of this project says Walsh Urdu Wikipedia started in January 2004 has
more than 10800 articles What are the challenges that still remain in the
popularization of Hindi and Urdu on the Internet ―The major challenge is
Internet penetration and PC prices The moment we have better Internet
penetration especially in smaller towns and PC prices go down Hindi and Indian
languages can flourish on the Net says Karnik of webduniacom India had more
than 49 million Internet users in June 2008 out of which about 9 million used the
Internet regularly according to a study by Juxtconsult India a research company
―There is a big opportunity in Indian languages Studies showed that only 28
percent of Indian Web surfers preferred English on the Web but as good quality
content in Indian languages was not easily available they did not visit many local
language sites says Mrutyunjay Mishra co-founder of Juxtconsult IT experts
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 25: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/25.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 110
agree ―Localization is the key to success in countries like India In order to get
the widest audience reach one has to look at Hindi because in a country of over a
billion people English is spoken by less than 80 million people says Krishna of
Yahoo Googlelsquos Roy-Chowdhury agrees with him ―The Web is the
democratization of access to information he says adding that the Internet is not
a luxury but a powerful tool to improve life But is Hindi earning enough revenue
to be dubbed successful on the Web ―Thatlsquos a tough question Right now it is not
much says Mishra But Roy-Chowdhury thinks revenue is bound to come once
Hindi reaches a critical volume ―If we look at how the Internet developed in the
US it may provide a useful analogy First came content which was mostly
produced by people who had a passion for putting up content they cared about
Traffic and monetization was not the motive Second came growing readership as
people started discovering content This set off a virtuous cycle in which content
eventually became a viable monetizable business Third were the application
developers who could now focus on moving the online experience beyond passive
consumption of information to interactivity community building service delivery
and a host of other innovations Roy- Chowdhury says ―Indialsquos market was
stuck in phase one for a long time And I believe it has recently entered phase
two [63]
45 Development of Language Corpora in Indian Languages
Kolhapur Corpus of Indian English (KCIE) was the first Indian language
corpora for Indian English which was developed under the leadership of Prof
SV Shastri at the Shivaji University Kolhapur India in 1988 KCIE contains
approximately one million words of Indian English drawn from materials
published in the year 1978 This is collected for a comparative study among the
American the British and the Indian English (Dash) Central Institute of Indian
Language (CIIL) is a nodal agency for development of Indian Language Corpora
It has co-coordinated with various Indian agencies and Universities for
developing more than 45 million corpora in Scheduled Language of India which
is also a part of TDIL program Enabling Minority Language Engineering
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 26: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/26.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 111
(EMILLE) program provides the corpora architecture and tool for Asian
languages It has a monolingual corpus which contains approximate 96157000
words and a parallel corpus consists of 200000 words of text in English which
helps in the translation of Bengali Hindi Punjabi and others languages
C-DAC Noida has developed the parallel text corpus Gyan-Nidhi for 12
Indian languages (Hindi Punjabi Gujarati Marathi Tamil Telugu Kannada
Nepali Oriya Malayalam Bangla Assamese) and English Gyan-Nidhi is also a
multilingual parallel corpus which is a repository of One Million Pageslsquo of
knowledge based text Mahatma Gandhi International University has started the
project Hindi Samghrahalsquo for repository of Hindi words database and dialect
mapping of Hindi Department of Information Technology of Government of
India has started the project for developing the Indian language Corpora Indian
Language Corpora Initiative (ILCI) ILCI is a consortium project for building the
parallel annotated corpora under the leadership of Dr Girish Nath Jha JNU New
Delhi It involves 11 Indian languages and also English
451 Machine Translation in India
Although Translation in India is old Machine Translation is
comparatively young Earlier efforts in this field have been noticed since 1980
involving different prominent Institutions such as IIT Kanpur University of
Hyderabad NCST Mumbai and CDAC Pune During late 1990 many new
projects initiated by IIT Mumbai IIIT Hyderabad AU-KBC Centre Chennai
and Jadavpur University Kolkata were undertaken TDIL has started a
consortium mode project since April 2008 for building computational tools and
Sanskrit-Hindi MT under the leadership of Amba Kulkarni (University of
Hyderabad) The goal of this Project is to build childrenlsquos stories using
multimedia and e-learning content
452 Anglabharati
IIT Kanpur has developed the Anglabharti Machine Translator technology
from English to Indian languages under the leadership of Prof RMK Sinha It is
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 27: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/27.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 112
a rule-based system and has approximately 1750 rules 54000 lexical words
divided into 46 to 58 paradigms It uses pseudo Interlingua named as PLIL
(Pseudo Lingua for Indian Language) as an intermediate language The
architecture of Anglabharti has six modules Morphological analyzer Parser
Pseudo code generator Sense disambiguator Target text generators and Post-
editor Hindi version of Anglabharti is AnglaHindi which is web based
application which is also available for use at httpanglahindiiitkacin To
develop automated translator system for regional languages Anglabharti
architecture has been adopted by various Indian institutes for example IIT
Guwahati
453 Anubharti
Prof RMK Sinha developed Anubharti during 1995 at IIT Kanpur
Anubharti is based on hybridized example-based approach The Second phase of
both the projects (Anglabharti II and Anubharti II) has started from 2004 with
new approaches and some structural changes
454 Anusaaraka
Anusaaraka is a Natural Language Processing (NLP) Research and
Development project for Indian languages and English undertaken by CIF
(Chinmaya International Foundation) It is fully-automatic general-purpose high-
quality machine translation systems (FGH-MT) It has software which can
translate the text of any Indian language(s) into another Indian Language(s) based
on Panini Ashtadhyayi (Grammar rules)It is developed at the International
Institute of Information Technology Hyderabad (IIIT-H) and Department of
Sanskrit Studies University of Hyderabad
455 Mantra
Machine Assisted Translation Tool (Mantra) is a brain child of Indian
Government during 1996 for translation of Government orders notifications
circulars and legal documents from English to Hindi The main goal was to
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 28: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/28.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 113
provide the translation tools to government agencies Mantra software is available
in all forms such as desktop network and web based It is based on Lexicalized
Tree Adjoining Grammar (LTAG) formalism to represent the English as well
as the Hindi grammar Initially it was domain specific such as Personal
Administration specifically Gazette Notifications Office Orders Office
Memorandums and Circulars gradually the domains were expanded At present
it also covers domains like Banking Transportation and Agriculture etc Earlier
Mantra technology was only for English to Hindi translation but currently it is
also available for English to other Indian Languages such as Gujarati Bengali and
Telugu MANTRA-Rajyasabha is a system for translating the parliament
proceedings such as papers to be laid on the Table [PLOT] Bulletin Part-I
Bulletin Part- II List of Business [LOB] and Synopsis Rajya Sabha Secretariat of
Rajya Sabha (the upper house of the Parliament of India) provides funds for
updating the MANTRARajyasabha system
456 UNL-based MT System between English Hindi and Marathi
IIT Bombay has developed the Universal Networking Language (UNL)
based machine translation system for English to Hindi Language UNL is United
Nations project for developing the Interlingua for worldlsquos languages UNL-based
machine translation is developing under the leadership of Prof Pushpak
Bhattacharya IIT Bombay
457 English-Kannada MT System
Department of Computer and Information Sciences of Hyderabad
University has developed an English-Kannada MT system It is based on the
transfer approach and Universal Clause Structure Grammar (UCSG)This project
is funded by the Karnataka Government and it is applicable in the domain of
government circulars
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 29: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/29.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 114
458 SHIVA and SHAKTI MT
Shiva is an Example-based system It provides the feedback facility to the
user Therefore if the user is not satisfied with the system generated translated
sentence then the user can provide the feedback of new words phrases and
sentences to the system and can obtain the newly interpretive translated sentence
Shiva MT system is available at (httpebmtserciiscernetinmtloginhtml)
Shakti is a statistical approach based rule-based system It is used for the
translation of English to Indian languages (Hindi Marathi and Telugu) Users can
access the Shakti MT system at (httpshaktiiiitnet)[24]
459 Tamil-Hindi MAT System
K B Chandrasekhar Research Centre of Anna University Chennai has
developed the machine-aided Tamil to Hindi translation system The translation
system is based on Anusaaraka Machine Translation System and follows lexicon
translation approach It also has small sets of transfer rules Users can access the
system at httpwwwaukbcorgresearch_areasnlpdemomat
4510 Anubadok
Anubadok is a software system for machine translation from English to
Bengali It is developed in Perl programming language which supports processing
of Unicode encoded and text for text manipulations The system uses the Penn
Treebank annotation system for part-of-speech tagging It translates the English
sentence into Unicode based Bengali text Users can access the system at
httpbengalinuxsourceforgenetcgibinanubadokindexpl
4511 Punjabi to Hindi Machine Translation System
During 2007 Josan and Lehal at the Punjab University Patiala designed
Punjabi to Hindi machine translation system The system is built on the paradigm
of foreign machine translation system such as RUSLAN and CESILKO The
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 30: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/30.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 115
system architecture consists of three processing modules Pre Processing
Translation Engine and Post Processing
4512 Contribution of Private Companies in Evolving the ILT ndash Indian
language Search
45121 Engine Guruji
Gurujicom is the first Indian language search engine founded by the two
IIT Delhi graduate Anurag Dod and Gaurav Mishra assisted by the Sequoia
Capital gurujicom uses crawls technology based on propriety algorithms For
any query it goes into Indian languages contents deep and tries to return the
appropriate output guruji search engine covers a range of specific content news
entertainment travel astrology literature business education and more
45122 Google
Internet searching giant Google also supports major Indian Languages
such as Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam
and Punjabi and also provides the automated translation facility from English to
Indian Languages Google Transliteration Input Method Editor is currently
available for different languages such as Bengali Gujarati Hindi Kannada
Malayalam Marathi Nepali Punjabi Tamil Telugu and Urdu
45123 Microsoft Indic Input Tool
Microsoft has developed the Indic Input Tool for Indianisation of
computer applications The tool supports major Indian languages such as Bengali
Hindi Kannada Malayalam Tamil and Telugu It is based on a syllable-based
conversion model WikiBhasa is Microsoft multilingual content creation tool for
translating Wikipedia pages into multilingual pages So source language in
WikiBhasa will be English and Target language can be any Indian local
language(s)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 31: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/31.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 116
45124 Webdunia
Webdunia is an important private player which assists the development of
Indian language technology in different areas such as text translation software
Localization and Website localizations It is also involved in research and
development of Corpus creationcollection and Content Syndication Moreover it
provides the facility of language consultancy It has developed various
applications in Indian Languages such as My Webdunia Searching Language
Portals 24 Dunia Games Dosti Mail Greetings Classifieds Quiz Quest
Calendar etc
45125 Modular InfoTech
Modular InfoTech Pvt Ltd is a pioneer private company for development
of Indian Languages software It provides the Indian language enablement
technology to many state governments and central government in e-governance
programs It has developed the software for multilingual content creation for
publishing newspapers and also has developed the qualitative Unicode based
Fonts for major Indian languages It has specifically developed the Shree-Lipi
Gurjrati pacakage for the Gujarati language which is useful in DTP sector
corporate offices and e-Governance program of the Government of Gujarat
4513 Government Effort for Evolving Language Technology
Indian government was aware about this fact Since 1970 the Department
of Electronics and the Department of Official Language were involved in
developing the Indian language Technology Consequently ISCII (Indian Script
Code for Information Interchange) is developed for Indian languages on the
pattern of ASCII (American Standard Code for Information Interchange) Also
Indian languages Transliteration (ITRANS) developed by Avinash Chopde
and ITRANS represents Indian language alphabets in terms of ASCII (Madhavi et
al 2005) The Department of Information Technology under Ministry of
Communication and Information Technology is also putting the efforts for
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 32: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/32.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 117
proliferation of Language Technology in India And other Indian government
ministries departments and agencies such as the Ministry of Human Resource
DRDO (Defense Research and Development Organization) Department of
Atomic Energy All India Council of Technical Education UGC (Union Grants
Commission) are also involved directly and indirectly in research and
development of Language Technology All these agencies help develop important
areas of research and provide funds for research to development agencies As an
end-result IndoWordNet was developed for the Indian languages on the pattern of
English WordNet
45131 TDIL Program
Government of India launched TDIL (Technology Development for Indian
Language) program TDIL decides the major and minor goal for Indian Language
Technology and provide the standard for language technology TDIL journal
Vishvabharata (Jan 2010)outlined short-term intermediate and long-term goals
for developing Language Technology in India[64]
46 Search Engines available in Hindi Hindi Online Search Tools
India centric localized search engines market is saturating fast real fast In
last year alone there must have been more than 10-15 Indian local search engines
launched Some smaller and some biggerSome with huge funding and some with
none This space is so crowded right now that it is difficult to know who is really
winning However we attempt to put forth a brief overview of current scenario
Here are some of them who fall in the localized Indian search engine category
Guruji Raftaar Hinkhoj Hindi Search Engine Yanthram Justdial
Tolmolbol burrp Dwaar onyomo khoj nirantar bhramara gladoo
lemmefindin along with Ask Laila which launched a couple of days back Also
we do have localized versions of big giants Google Yahoo and MSN
Each of these Indian search engines have come forward with some or the
other USP (Unique Selling Proposition) It is too early to pass a judgment on any
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 33: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/33.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 118
of them These are testing stages and every start-up is adding new features and
making their services better
461 Most Used Search Tools in India for web activity a Survey by Juxt
Consult 2008-2009 Report
4611 Most Used Websites
Websites 2008 Stats 2009 Stats
Google 37 35
Yahoo 32 25
Rediff 7 4
Orkut 6 7
More Info India online 2008 India online 2009 [65]
Table 412 Most Used Websites
4612 Info Search English
Website 2008 Stats 2009 Stats
Google 81 76
Yahoo 7 7
Wikipedia 3 6
English 3 4
More Info India online 2008 India online 2009 [65]
Table 413 Info Search English
4613 Info Search Local Language
Website 2008 Stats 2009 Stats
Google 65 34
Yahoo 12 29
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 34: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/34.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 119
Rediff 4 15
Teluguone 2 0
Guruji 1 18
Raftaar 1 02
Hindi 1 NA
Webdunia 1 07
Khoj 07 2
More Info India online 2008 India online 2009 [65]
Table 414 Info Search Local Language
47 Problems faced while search in Hindi Low recall
The preliminary investigation into typical information access technologies
by applying present day popular techniques show a severe problem of low recall
while accessing information using Indian language queries For instance many
times popular web search engines such as Google Yahoo and Guruji result in `0
search results for Indian language queries giving an impression that no documents
containing this information exist In reality these search engines face a recall
problem while dealing with Indian languages due to the multiple spellings
morphological variants of keywords and English keywords in HindiTable 415
illustrates a few such cases For example a Hindi query for ―world trade center
aatank-waadi hamlaa ―वलडम टर ड स नटय आत कव दी हरभ ―is shown to result in
`0 documents in table 415 however a small rephrasing of the query in table 416
shows that these keywords exist in second search result But just saying we have a
recall problem may not be sufficient The next obvious question that follows
would be `how much is it a problemlsquo In other words we need to somehow
quantify the problem For this purpose we conducted many experiments to
determine at what levels does this recall problem occur and by how much
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 35: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/35.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 120
Table 415 Problems faced while search in Hindi Low recall
Table 416 Improved Recall
48 Factors affecting performance of Hindi search
481 Morphological Factors
Hindi language is morphologically rich language It has well defined
morphological structure and well defined grammar But the grammatical and
language structural standard is least followed due to various reasons One of the
reasons is the language diversity in India Including Hindi there are about 28
Languages spoken in India and Hindi being the National Language of India is
influenced by the regional languages which results a change in dialects not only in
Hindi Query Google Yahoo Guruji
वलडम टर ड स नटय आत कव दी हरभ 0 0 0
इिनडमन इ िसटचम ट ह लथ एज क शन ऐनड रयसचम 0 0 0
वलडम टर ड सटय आतॊकी हभर
8820 92 12
वलडम टर ड सटय आतॊकीअटक
331 10 1
इॊडिमनइॊसटटटमटटवाटम शिा औय रयसचम
708 50 1
बायतीम सॊटथान टवाटम शिा औय िोध 37100 7400 93
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 36: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/36.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 121
speaking but writing also Every language uses some markers like (English
language uses s es ing and ऐ म ा ओ MAATRAAS in Hindi language) are
used with a root word and new words are constructed For ex (Planning in
English) Yojnaaon म जन ओ Yojnaayein म जन ए in Hindi are the morphological
variants of root word Yojnaa म जन It is desirableto combine all the
morphological variants of the words in a single canonical form The process is
called as word stemming and this canonical form is called as root word or base
word
482 Phonetic nature of Hindi Language and Spelling variations
The major reasons for spelling variations in language can be attributed to
the phonetic nature of Indian languages and multiple dialects transliteration of
proper names words borrowed from regional and foreign languages and the
phonetic variety in Indian language alphabet The variety in the alphabet
different dialects and influence of regional and foreign languages has resulted in
spelling variations of the same word For example Following are the possible
spelling variations for the Hindi word अ गर ज (angrējī) (means English)
There are numerous words which are phonetically equivalent but vary in writing
The word school in hindi can be written in different ways (सक र सक र सक र)
When information is searched for a single standard keyword school सक र and non
standard Hindi phonetic equivalent keyword सक र 69 million results are shown
by Google for former and 14 million for later Hindi Language is influenced by
the other regional languages which results in phonetic variety of words for
example the English word school (सक र in Hindi) is pronounced and written as
ISKOOLइसक र by the majority of population of India in different states For the
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 37: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/37.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 122
Hindi word ISKOOLइसक र more than two thousand results are found Search
engines should be capable of retrieving the results against phonetically equivalent
words of keywords entered to search User may use any keyword for searching
and search engines should be capable to support all phonetically equivalent
words
Also no particular standard exists for writing the keyword to fetch Hindi web
data For every phonetically equivalent keywords in the query variation in the
results exist Ie a different set of documents are retrieved with least repetition
The native Hindi user may not be aware of the Phonetic issues in Hindi IR and
may miss the relevant information of hisher use
483 Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has following
commonly spoken synonyms गहन ज वयअर क य
484 Ambiguous words
Ambiguous words deflate the relevancy of the results The examples
mentioned below shows this aspect very clearly Consider the following query
(In English) (Women like gold)
(In Hindi) (न यी क स न ऩस द ह )
In this query the word स न (Gold) is ambiguous as it has another meaning ie to
sleep In the context of above query the word स न is gold But it can be also
interpreted as women like to sleep
Another Query (In English) (The common peoples choice)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 38: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/38.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 123
(In Hindi) (आभ र गो की ऩस द)
Here the word आभ is ambiguous The word आभ in above query means common
However In Hindi it also means mango So the above query can be interpreted as
―mango is peoplelsquos choice
Many words are polysemous in nature Finding the correct sense of the words in a
given context is an intricate task One word has more than one meaning and
meaning of word is depends on context of sentence Exampleकय (Tax) having
synonyms बम ज श लक स द भहस ऱ ट कस in one context and in another context कय
(Hand or arms) हसत फ ह आच शफय and कय (to do) कयन in another context
485 Influence of English on Hindi Information retrieval
The English language has influenced Indian languages in many ways it
affected the pronunciation of Hindi words So many English words have been
localized in India Some of the words appear as if they were native Hindi words
Indians sometimes are unable to get the equivalent word for that of the English
For instance the words such as road bus pen television radio please rail
email password insurance internet director department etc are used even by
the uneducated Indians without being aware of the language of those words Most
of the Indians use these words in English than in their native language English
language has its influence over Hindi not only in speaking but in writing too
When we talk about especially Hindi literature on web it becomes more evident
Influence of English onHindi language has been observed as one of the very
important parameters for Hindi Information retrieval
The effect of aforesaid factors on Hindi information retrieval is shown in
following tables and figures
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 39: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/39.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 124
Table 417 List of Hindi queries
49 Discussion Morphological Factors
We have taken a sample set of 50 queries to test the affect of the root
word Following Table 417 is the set of randomly selected queries from the set
which throw light on the effect of the root word on the performance of Hindi
language search engines Table 418 shows the examples for effect of
morphological factors on Hindi queries
S No Query in Hindi Meaning In English SNO Query in Hindi Meaning In English
1 ब यतवष मवन Indian rain forest 62 ऩकषमोकीपरज ततम Birdlsquos species
11 ब यत मवष मवनो Indian rain forests 7 क षषसभसम Agriculture problem
2 हव ईद घमटन क क यण Reason for air crash 71 क षषसभसम ओ Agriculture problems
21 हव ईद घमटन ओ क क यण Reason for air crashes 8 कीटन शकक इसत भ र Use of pesticide
3 ब यतभफ रीज न व रीब ष Language spoken in India 81 कीटन शकोक इसत भ र Use of pesticides
31 ब यतभफ रीज न व रीब ष ए Languages spoken in India 9 भ नमसकय ग Mental illness
32 ब यतभफ रीज न व रीब ष ओ Languages spoken in India 91 भ नमसकय चगमो Mental patients
4 षवर पतह न ऩयझ र Lake on the verge of
extinction 92 भ नमसकय ग Mental Patient
41 षवर पतह न ऩयझ र Lakes on the verge of
extinction 10 गर भ णषवक सम जन Policy for village
5 पर क ततकआऩद Natural calamity 101 गर भ णषवक सम जन ओ Policies for village
51 पर क ततकआऩद ए Natural calamities 102 गर भ णषवक सम जन ए Policies for village
6 ऩ कीपरज तत Bird species 11 परभ खक षषक दर Major agricultural
office
61 ऩकषमोकीपरज तत Birdlsquos species 111 परभ खक षषक नदरो Major agricultural
offices
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 40: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/40.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 125
Table 418 Effect of morphological factors on Hindi queries
S No Root
word
s
Listing of Keywords Morphological
variants
Documents Returned
Google Bing Guruji Google Bing Guruji
1
ब यत
वष मवन
ब यतवष मवन वष मवनवन
वष मवनवन ब यतवष मवन
वन
11 ब यत
वष मवन 50500 4680 485
12 ब यत मवष मवनो
40400 680 61
2 द घमटन
द घमटन द घमटन ओ
द घमटन द घमटन 21 द घमटन 133000 2410 284
22 द घमटन ओ 117000 420 23
3 ब ष ब ष ब ष ओ ब ष ब ष
31 ब ष 161000 8330 961
32 ब ष ए 6200 935 188
33 ब ष ओ 6090 441 356
4 झ र झ रझ रो झ र झ र
41 झ र 4740 278 25
42 झ र 1270 28 1
5 आऩद
आऩद आऩद ओ
आऩद आऩद 51 आऩद 102000 4030 410
52 आऩद ए 1160 64 20
6
ऩ परज तत
ऩ ऩकषमोपरज ततपरज ततमोपरज तत
म
ऩ परज तत
ऩ परज तत
61 ऩ परज तत 48200 1670 98
62 ऩकषमोपरज तत
47600 1150 84
63 ऩकषमोपरज ततम
33800 747 25
7 सभसम
सभसम ए सभसम ओ सभ
सम सभसम सभसम
71 सभसम 584000 30200 1889
72 सभसम ओ 584000 7150 1356
8
कीटन शक
कीटन शक
कीटन शको कीटन शक कीटन शक
81 कीटन शक 36300 1360 333
82 कीटन शको 35800 800 270
9 य ग य गोय ग य ग य ग
91 य ग 205000 21600 1423
92 य चगमो 128000 3280 239
93 य ग 112000 6280 647
10 म जन
म जन ओ म जन
म जन म जन
101 म जन 673000 18500 3343
102 म जन ओ 669000 6020 990
103 म जन ए 673000 2860 416
11 क दर क दरीमक दर क दर क दर
111 क दर 261000 11300 655
112 क नदरो 29500 1850 105
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 41: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/41.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 126
Table 419 precision values of the three search engines
Figure 41 precision values of the three search engines
0
02
04
06
08
1
11 21 31 33 42 52 62 71 81 91 93 102 111
P GoogleP BingP Guruji
S No Query Precision 10 SNO Query Precision 10
Google Bing Guruji Google Bing Guruji
11 ब यतवष मवन 05 03 01 63 ऩकषमोकीपरज ततम
09 06 01
12 ब यत मवष मवनो 03 01 01 71 क षषसभसम 07 03 02
22 हव ईद घमटन क क यण
05 03 01 72 क षषसभसम ओ 07 04 02
22 हव ईद घमटन ओ क क यण
03 03 01 81 कीटन शकक इसत भ र
1 08 02
31 ब यतभफ रीज न व रीब ष
09 05 05 82 कीटन शकोक इसत भ र
09 06 02
32 ब यतभफ रीज न व रीब ष ए
05 04 03 91 भ नमसकय ग 07 06 04
33 ब यतभफ रीज न व रीब ष ओ
05 02 02 92 भ नमसकय चगमो 09 06 03
41 षवर पतह न ऩयझ र 05 03 02 93 भ नमसकय ग 09 05 04
42 षवर पतह न ऩयझ र 05 03 0 101 गर भ णषवक सम जन
09 07 03
51 पर क ततकआऩद 1 04 03 102 गर भ णषवक सम जन ओ
1 06 0
52 पर क ततकआऩद ए 06 06 01 103 गर भ णषवक सम जन ए
08 06 02
61 ऩ कीपरज तत 1 07 02 111 परभ खक षषक दर 05 04 0
62 ऩकषमोकीपरज तत 09 06 01 112 परभ खक षषक नदरो 04 02 02
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 42: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/42.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 127
It has been observed that documents returned by all three search engines are more
in number when query with root word is submitted This justifies the searching of
documents in the root word because in general we get better results with the
keywords in their root form
It has also been observed that only Google shows listing of morphological
variants of root words where as Bing and Guruji show only listing of root word
supplied in almost all the sample queries listed above in the table
From the above results it is evident that only Google indexes the documents
keyword in their root form Bing and Guruji do not index in that form that is the
reason number of documents retrieved in their case is less in comparison to
Google The overall comparison of results from the three search engines in tables
above show that in general the quantity of results retrieved increased when the
keywords are used in their root form In case of search engines the quality of
results is more important than the quantity Figure 41and table 419 shows the
comparison of the precision values of the three search engines The precision
value is calculated by taking the top 10 results of the search engines On closely
observing the results we can say that precision value in case of Google is high in
almost all queries As mentioned above Google does its indexing in the root form
of keywords it can be said that that relevancy of the results is also high in Google
in comparison to other two search engines which denotes that not only quantity
but the quality of results is also affected by the morphological variations in the
keywords
410 Discussion Phonetic nature of Hindi Language and Spelling variations
Search engines should be capable of retrieving the results against
phonetically equivalent words of keywords entered to search User may use any
keyword for searching and search engines should be capable to support all
phonetically equivalent words
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 43: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/43.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 128
Following are randomly selected queries from the set of 50 queries tested on
Google search engine Tables below show the results and precision offered by
Table 420 results of the search engine on Phonetic nature of Hindi language
Hindi Query
With Bold
Standard
Keywords
Phonetic variations of the Keywords Google Results
for query having
keywords
No of
Results
Precision
10
ससजिमो भ िहयीर
ऩद थम
सिबजमो सबज मो
जहयीर जहरयर ससजिमोिहयीर 97 09
ससजिमो जहयीर 632 09
सिबजमो िहयीर 194 09
आसभान छ त भहॊगाई
आसभ भह ग ई भ ह ग ई आसभ न भहॊगाई 35300 10
आसभ न भहॉगाई 1040 10
आसभ भह ग ई 14 06
आसभाॊ भह ग ई 563 07
भरषटाचाय स आिादी
भरशट च य
बयषटट च य आज दी भरषटाचायआज दी 211000 08
भरशटाचायआज दी 214 06
बयषटाचाय आज दी 447 07
भरषटट च यआिादी 1090000 09
भरशट च य आिादी 1040 07
बयषटट च यआिादी 1190 08
अननाहिाय क आनदोरन
अनन हज य आ द रनआ द रन अनन हज य आनदोरन
84700 03
अनन हज य आॊदोरन
85100 08
अनन हज य क आॉदोरन
78 06
अनना हिाय क आनद रन
399 05
अनन हिाय क आनद रन
3260000 10
फयोिगायी सभसम सभ ध न
फ य जग यी फय जग यी फ य जा यी
फयोिगायी 9650 09
फयोिगायी 80600 10
फयोिगायी 170 07
फयोिगायी 30 05
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 44: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/44.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 129
Figure 42 Precision Charts for Phonetic nature of Hindi language
In the above table and figure it can be clearly seen that search engines return a
handful of documents on various Hindi phonetically equivalent queries It is
observed that no particular standard exists for writing the keyword to fetch Hindi
34
33
33
Query No 1
1 11 12
31
30
18
21
Query No 2
2 21 22 23
18
13
1520
16
18
Query No 3
3 31 32
33 34 35
5823
109
Query No 4
1st Qtr 2nd Qtr
3rd Qtr 4th Qtr
29
32
23
16
Query No 5
5 51 52 53
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 45: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/45.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 130
web data For every phonetically equivalent keywords in the query variation in
the results exist Ie a different set of documents are retrieved with least
repetition From the precision chart it is clearly observed that the degree of
relevance for queries containing phonetically equivalent keywords is almost same
or nearly equal The native Hindi user may not be aware of the Phonetic issues in
Hindi IR and may miss the relevant information of hisher use
411 Discussion Words Synonyms
A word can express a myriad of implications connotations and attitudes
in addition to its basic ―dictionary meaning And a word often has near
synonyms that differ from it solely in these nuances of meaning Choosing the
right word can be difficult for people as well as for the information retrieval
system For example the word (आब षण) in Hindi (Ornament) in English has
following commonly spoken synonyms गहन ज वयअर क य
Table 421 and Figure 43 have been presented below which shows the
comparison of precision values against three search engines
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 46: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/46.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 131
Table 421 Effect of word synonyms on Hindi IR
S NO Query Standard
Hindi
Words
Synonyms Documents Returned
Google Per
10
Bing Per
10
Guruj
i
Per
10
1 स न क आब
षण
आब षणगहन
11 स न क आब षण 217000 08 3250 07 381 05
12 स न क गहन 188000 08 2590 06 389 05
13 स न क ज वय 78900 08 1670 07 311 06
14 स न क अर क य 9490 05 633 04 70 0
15 स न क आबयण 493 05 38 03 1 0
2 क र फ दर फ दर
21 क र फ दर 233000 07 7510 07 733 03
22 क र भ घ 40700 09 1500 08 99 06
23 क र जरधय 1570 06 54 06 2 02
3 सतर सशिकतकयण
सतर न यी
31 सतर सशिकतकयण
9950 09 1570 07 760 06
32 न यीसशिकतकयण
29300 09 1910 09 736 04
33 भहहर सशिकतकयण
96300 09 5160 08 1091 03
34 औयतसशिकतकयण
7670 08 680 07 510 02
4 मसक दयक अ हक य
अ हक य
41 मसक दयक अ हक य
1990 04 18 03 60 06
42 मसक दयक अमबभ न
2400 06 304 02 16 01
43 मसक दयक घभ ड
495 05 54 06 9 01
5 व रग ओ व
51 व रग ओ 6960 1 698 08 29 05
52 ऩ डरग ओ 13400 1 1080 09 143 09
53 दयखतरग ओ 481 05 19 05 0 0
6 आ खद न आ ख
61 आ खद न 34000 07 3690 05 312 03
62 न तरद न 77500 1 3240 1 159 09
63 च द न 2450 08 427 01 36 01
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 47: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/47.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 132
Figure 43 Comparison of precision values against three search engines
From the examples above it is observed that using Hindi keywords with their
synonyms improves the information retrieval against a query in Hindi language
Not only quantity of documents returned is affected but quality is also affected by
using synonyms of Hindi keywords
From the above table and figure it is to be observed that documents returned by
Google are more in quantity than other two search engines and least number of
documents get returned by Guruji search engine the reason behind may be
availability of less documents or poor indexing However we are interested in
quality of results than quantity As far as quality of results is concerned it can be
clearly seen that Google and Bing provide quality data than Guruji And in the
average case Google still stands first in the row that means precision values by
Google are more than that of Bing and Guruji in this case Thus it becomes clear
that by changing a keyword into its synonym equivalent results can be obtained
Therefore it is evident that synonyms of keywords play an important role in the
process of Hindi information retrieval system
412 Discussion Ambiguity
In a sample set of 50 ambiguous queries below we present five randomly
selected ambiguous queries In figure 35 second column contains five queries in
Hindi third column holds the ambiguous keyword in one context and fifth
column holds the same ambiguous keyword in other context Fourth and sixth
columns hold the meaning of queries in English with respect to the ambiguous
keyword in context
0
02
04
06
08
1
11 12 13 14 15 21 22 23 31 32 33 34 41 42 43 51 52 53 61 62 63
Bing
Guruji
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 48: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/48.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 133
Table 422 List of randomly selected ambiguous queries
Ambiguous queries mentioned above in the figure are tested for results against
three search engines Google Bing and Guruji Results are shown below in tables
Table 423 Ambiguity test for Google
Table 424 Ambiguity test for Bing
SNo Query For keyword
as
In English For keyword as In English
1 न यी क स न ऩस द ह स न (Gold) Women like Gold स न (To sleep)
Women like to
sleep
2 आभ र गो की ऩस द आभ
(common)
Common manlsquos
choice आभ (Mango) Mango is
peoplelsquos choice
3 फ र षवक सऔय ऩ षण फ र (Children) Child Development
and Nutrition फ र(Hair) Hair
Development and
Nutrition
4 सऩ यो क पन पन (Art) Art of snake
charmers पन(Snake head) Snake charmerlsquos
snake head
5 म दध भ क र षवन श क र
(Aggregate)
Aggregate
destruction in wars क र(family) Destruction of
families in war
Query Ambiguous
keyword Documents
returned
Google Other
Context Results Found
Context Context
1 स न 50800 Gold 5 To sleep 2 3
2 आभ 488000 Common 3 Mango 3 4
3 फ र 2900000 Children 7 Hair 3 0
4 पन 184 Art 0 Snake head 10 0
5 क र 17800 Aggregate 2 Family 3 5
Query Ambiguous
keyword Documents
returned
Bing Other
Context Results Found
Context Context
1 स न 2680 Gold 2 To sleep 2 6
2 आभ 17800 Common 3 Mango 3 4
3 फ र 4030 Children 6 Hair 2 2
4 पन 25 Art 0 Snake head 9 1
5 क र 1900 Aggregate 0 Family 2 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 49: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/49.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 134
Table 425 Ambiguity test for Guruji
From the above results obtained in tables it is observed that all three search
engines return documents without differentiating between the contexts of
keyword in the query In the above table the last column labeled as ―other
Context holds the number of results which are not relevant to the query supplied
or those documents which contains the keywords in other non required context
From the results it is clear that all search engines return documents in different
contexts Therefore it can be said that search engines underperform when supplied
with ambiguous queries Numbers in column labeled as ―other Context signifies
the deviation from relevance For example for query म दध भ क र षवन श
(aggregate destruction in wars) the column ―Other Context for Google contains 5
documents for Bing contains 8 documents and for Guruji contains all 10
documents
In another query सऩ यो क पन (art of snake charmers) another context (Snake
charmerlsquos snake head) retrieved documents are expected to be in context (art) but
from the above results obtained it can be seen that google returns all 10
documents in non required context (snake head) and Bing returns 9 documents
where as Guruji fails to retrieve even a single document In the above scenario it
becomes important for the search engines to address to the issue of ambiguity in
keywords to obtain better results
Query Ambiguous
keyword Documents Returned
Guruji Other
Context Results Found
Context Context
1 स न 109 Gold 0 To sleep 0 10
2 आभ 6756 Common 3 Mango 0 7
3 फ र 635 Children 5 Hair 2 3
4 पन No Results Found
Art na Snake head na na
5 क र 84 Aggregate 0 Family 0 10
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 50: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/50.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 135
413 Discussion Influence of English on Hindi Information retrieval
English language has its influence over Hindi not only in speaking but in
writing too When we talk about especially Hindi literature on web it becomes
more evident Influence of English on Hindi language has been observed as one of
the very important parameters for Hindi Information retrieval which is more
clearly explained in the example as
Example In English the word exercise is written in Hindi as (एकसयस इज) The
word exercise (एकसयस इज) has following phonetic variations एकसयस इज
एकसयस इज एकस यस इज एकसयस इज in Hindi As per various phonetic
variations mentioned above in the example a variety of popular keywords and
queries have been tested for experiments from various domains
In the following figure a sample set of common and popular English keywords
along with their phonetic variants written in Hindi can be seen
English Words
Google Transliteration
Standard Hindi Keywords
Phonetic Equivalents Search Engine Google
Woman वोभन व भन व भ न व भ न 42200 101000 3520 34800
Insurance इनसयाॊस इ शम यस इ शम य स इनशम य नस 1220 246000 1390 3710
Cancer क सय क सय क नसय क नसय 1880000 35700 6490 1820
Hospital हॉसटऩटर ह िसऩटर ह िसऩटर ह सऩ टर 404000 263200 2440 100
Corruption कोररसततओन कयपशन कयऩशन क यपशन 1890 368000 1110 58 Computer कॊ तमटय कमपम टय कमपम टय क पम टय 4450000 1040000 261000
0 537000
University उननवशसतम म तनवमसमटी म तनव मसमटी म तनवसमटी 1540 1070000 1420 3270
Director डियकटय ड मय कटय ड इय कटय ड मय कटय 4600 735000 97000 8300
Accident एसकसिट एकस डट एकस ड नट ऐिकसडट 26300 125000 2510 5160
Parliament ऩशरअभट ऩ मरमम भट ऩ मरमभट ऩ मरमम भट
3240 38000 639 1120
specialist टऩशसअशरटट
सऩ मशममरसट
सऩ शमरसट सऩ श मरसट
143 1440 51300 9820
Expert एकसऩट एकसऩटम ऐकसऩटम ऐकसऩटम 269000 8730 182 8
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 51: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/51.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 136
Table 426 keywords along with their phonetic variants written in Hindi
From the above table it can be observed that search engine does return documents
for single keyword query documents for all phonetic variants of the keywords are
also returned which are huge in number It can also be seen that that people have
their own way of representing the Hindi words and no standard is followed for
storing Hindi data on web Also the documents are retrieved for every
phonetically variant English Keyword written in Hindi script In the above table
the column with bold Hindi entries shows the keywords which are obtained by
using Google transliteration tool The table 426 shows that transliteration does
not provide correct Hindi word in most of the cases For example the correct
transliteration for word University should be म तनवमसमटीwhereas Google
transliteration provides the word उतनव मसमतम which is completely wrong It is
clearly evident from the figure above that 1540 documents have been retrieved for
the wrong keyword उतनव मसमतम (University) and the same follows for other single
word queries Insurance इनस य स
ParliamentऩमरमअभटCorruptionक रम िपतओनPolicy
ऩ मरसम Specialistसऩ मसअमरसट It can be analyzed that Hindi website developers
make use of unchecked and non standard transliteration which makes the Hindi IR
process a difficult task
In the next example Multiword Hindi queries are selected to test the effect
of English influence on Hindi IR on precision and quantity of documents
retrieved The Hindi query is transformed into it variants by the software (design
and working discussed in next chapter) by replacing the Hindi keywords with
English keywords written in Hindi without changing the meaning of the query
Policy ऩोशरटम ऩॉमरस ऩ मरस ऩ मरस 609 395000 69700 289000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 52: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/52.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 137
The queries are converted into two levels In first level one Hindi word is replaced
by its English equivalent and in second level more than one words is replaced by
their English equivalent words without changing the meaning of the original
Hindi query Example
An English query ―Foreign investment in India can be written in Hindi as
―षवद श तनव श ब यत भ where Hindi keyword षवद श means ―Foreign ―प य न
and तनव श means ―investment ―इनव सटभट The query for the two levels is
transformed as
प य न तनव श ब यत भ (Foreign nivesh bharat mein)
प य न इनव सटभट ब यत भ (Foreign investment Bharat mein)
Therefore the original Hindi query ―षवद श तनव श ब यत भ ―videshi
nivesh Bharat mein supplied by the user is transformed into two equivalent
senses containing a mixture of both English and Hindi language where meaning
of the query remains same From the sample set of one hundred queries some
randomly selected queries are presented below in Table 427
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 53: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/53.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 138
Table 427 Transformed queries into two equivalent senses containing a mixture
of both English and Hindi Tabular representation
Figure 44 Transformed Queries into two equivalent senses
1520
8360
1020
Hindi Query 1
Level 1
Level 2
95800
37200
2150
Hindi Query 2
Level 1
Level 2
209000
85800
1660
Hindi Query 3
Level 1
Level 2
112000
0
49000271
Hindi Query 4
Level 1
Level 2
658000
52766
Hindi Query 5
Level 1
Level 2
181000
19700
47000
Hindi Query 6
Level 1
Level 2
Influenced Hindi Query Google
In English Hindi Query Level 1 Level 2 Search Results
Health and blood donation
सव सथ औय यकतद न
हलथ औय यकतद न
हलथ औय जरि_िोनिन
1520 8360 1020
Treatment for Blood pressure
यकतच ऩ क इर ज
जरिपरिय क इर ज
जरिपरिय क टरीटभट
95800 37200 2150
Cardiologist रदम चचककतसक हाट िॉकटय काडिमोरासिटट 241000 99100 1770
Government Employment Policy
सयक य दव य य जग य म जन
सयक य दव य य जग य टकीभ
सयक य दव य एमपतरॉमभट सकीभ
1840000 51600 93
Foreign investment in
India
षवद श तनव श ब यत भ
पायन तनव श ब यत भ
पायनइनवटटभट ब यत भ
658000 527 66
Corruption free India
भरषटट च य भ कत ब यत
कयतिन भ कत ब यत
कयतिन फरी इॊडिमा
181000 19700 47000
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 54: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/54.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 139
From the above table and figure it is evident that documents are returned for
original as well as transformed Hindi query and the quantity of the retrieved
documents is quite considerable In case of search engines the quality of results is
more important than the quantity therefore Table 428 and figure 45 are presented
below for the analysis of precision values Three popular search engines namely
Google Bing and Alta Vista are used for retrieving web results
Table 428 Analysis of precision values Tabular representation
Hindi
Query
Influenced Hindi Query Precision 10
Level 1 Level 2 Google Bing AltaVista
सव सथऔययकतद न
ह लथऔययकतद न
ह लथऔयबरड_ड न शन
09 09 09 09 08 08 09 09 08
यकतच ऩक इर ज
बरडपर शयक इर ज
बरडपर शयक टरीटभट
08 09 07 08 07 05 08 07 05
रदमचचककतसक
ह टमचचककतसक
क रड मम र िजसट 09 08 1 09 07 1 09 07 1
सयक यदव य य जग यम जन
सयक यदव य य जग यसकीभ
सयक यदव य एमपरॉमभटसकीभ 1 09 08 08 08 08 06 07 08
षवद श तनव शब यतभ
प य नतनव शब यतभ
प य नइनव सटभटब यतभ
09 08 08 09 08 04 09 09 04
भरषटट च यभ कतब यत
कयपशनभ कतब यत
कयपशनफरीइ रडम
1 1 1 1 1 1 1 1 1
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 55: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/55.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 140
Figure 45 Analysis of Inter precision values
From the above tables and figures it can be clearly seen that Hindi data of similar
nature can be mined out against Hindi queries by transforming them into their
variants by including English keywords written in Hindi The transformation of
queries resulted in an increase of retrieved data The relevance of the retrieved
data can also be seen in the precision column For every Hindi query and its
transformed variations the degree of relevance of documents is very close or
equal or improved eg for the Hindi query सव सथ औय यकतद न 9 out of first 10
documents are relevant and for transformed queries which are of similar nature
ह लथ औय यकत द न and ह लथ औय बर ड_ड न शन 9 of the first 10 documents are
relevant and the same repeats for rest of the Hindi queries as shown in the table
above Without transformation of Hindi queries the user may miss the chance of
retrieving the relevant information as the Hindi user may not be aware of the
presence of such information on web and may be unable to formulate the
variation query based on the factor of English influence From the above table it
can be said that English influenced Hindi information is present and is increasing
day by day on web By the inclusion of the English keywords in Hindi script in
the form of query the scope of searching in Hindi and getting relevant
information can be increased
09 09 09 08 09 07 09 08 1 1 09 08 09 08 08 1 1 1
09 08 08 08 07 0509 07
1 08 08 08 09 08 041 1 1
09 09 08 08 0705
0907
106 07 08 09 09
04
1 1 1
HQ1 L1 L2 HQ2 L1 L2 HQ3 L1 L2 HQ4 L1 L2 HQ5 L1 L2 HQ6 L1 L2
Inter Precession Chart HQ (Hindi Query) L (Level)
Google Bing AltaVista
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]
![Page 56: Chapter 4 Issues in Information Retrieval for Hindi Language](https://reader034.fdocuments.in/reader034/viewer/2022042707/5890454e1a28abc4618b47bd/html5/thumbnails/56.jpg)
Chapter 4 Issues in Information Retrieval for Hindi Language
A Study of Web Mining Tools for Query Optimization Page 141
The process of Hindi IR becomes more difficult because of the structure of
Hindi Language Generally people do not follow the actual Hindi writing standard
which widens the gap between Hindi web data and users
The relevant information can be mined out by transforming the Hindi queries
Search engines neither make transformations of the query nor find keyword
equivalents Because they may have the performance and throughput problems if
parameters like Hindi Phonetics synonyms and English equivalent Hindi
keywords are implemented at root level However this problem can be solved at
interface level Therefore to lessen the efforts of a Hindi user to search such
information a software has been developed (a detailed description has been
mentioned in next Chapter) which acts like an interface between user and search
engines With the help of this tool user can widen the scope of search on web in
Hindi language [66][67]