Chapter 3 Mathematical Modeling for Letter’s Frequencies...
-
Upload
trinhkhanh -
Category
Documents
-
view
221 -
download
3
Transcript of Chapter 3 Mathematical Modeling for Letter’s Frequencies...
-
Chapter 3
Mathematical Modeling for Letters Mathematical Modeling for Letters Mathematical Modeling for Letters Mathematical Modeling for Letters
Frequencies in Text and in Words Frequencies in Text and in Words Frequencies in Text and in Words Frequencies in Text and in Words
Initials for Hindi LanguageInitials for Hindi LanguageInitials for Hindi LanguageInitials for Hindi Language****
Frequencies of occurrence of letters in natural language texts are not
arbitrarily organized but follow some particular rules. Despite the apparent
freedom a writer has to create any desired sequence of words, written text
tends to follow some very simple set of laws. Letter frequencies enable us
to explain some characteristic features of human language. Words in
human language interact in sentences in non-random ways, and allow
humans to construct an astronomic variety of sentences from a limited
number of discrete units. Present chapter gives an analysis of frequencies
of letters of Hindi language alphabet on the basis of their occurrence in a
* Part of this chapter has been presented in Third Students conference of Linguistics in India, 19th-21st February, 2009, Centre for Linguistics, JNU, New Delhi, in the form of following research paper: Hemlata Pande and H. S. Dhami (2009). Analysis and mathematical model generation for letter frequencies in text and in words initials for Hindi language and measurement of entropies of most frequent consonants. (Abstract). The power point presentation of the paper is available at :
http://www.sconli.org/SCONLI_presentation/Hemlata.ppt
Contents from Chapter 2 and Chapter 3 in the form of a research paper Mathematical Modelling of Occurrence of Letters and Words Initials in Texts of Hindi Language has been accepted for publication in SKASE Journal of Theoretical Linguistics, Vol 7, 2010.
Este
lar
-
Chapter 3
80
text and also for their occurrence as words initial. We have applied Zipfs
law, which states that the frequency of the rth most frequent word type is
proportional to 1/r, to the letters instead of words for obtaining Zipfs order
and rank/frequency profiles of letters for their instances in text and in
beginning of words. Different distribution models have been tested for the
letter frequencies and finally appropriate parametric models have been
obtained for rank frequency distribution of letters in a corpus and for rank
frequency distribution of first letter of words in a Hindi text. Validations of
the models have been done by comparison of observed frequencies and
theoretical frequencies for various types of texts obtained from different
sources, and the variations in the values of parameters have been studied.
Comparison work has been accomplished for the entropies in text for the
vowel mark in Devanagari script written as matra following the consonant
(for the most frequent consonants).
3.1. Prelude
The properties of linguistic elements and their interrelations abide
by universal laws which are formulated in a strict mathematical way, in
analogy to the laws of well known natural sciences and it leads to the
application of some abstract mathematical and statistical tools to the
language theory. Mathematical Linguistics offers a number of tools that
allow abstract theories of language and mental computation to be compared
and contrasted with one another. An account of work in the field of
mathematical methods of linguistics can be had from the book of Partee et
al (1990), while compendium of several rules proposed to understand the
linguistic structure by which language communicates can be seen in the
work of Manning and Schutze (1999).
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
81
The process of formation of language models is quite popular in
linguistics. A linguistic model is a system of data (features, types,
structures, levels, etc.) and rules, which, taken together, can exhibit a
behavior similar to that of the human brain in understanding and
producing speech and texts. Mathematical models for languages had been
proposed by Harris (1982) in his famous book entitled A Grammar of
English on Mathematical Principles. In statistical language modeling,
large amounts of text are used to automatically determine the models
parameters. Bell and Witten (1988) has mentioned that natural-language
text is the result of an enormously complex process and modeling
strategies can be applied successfully to such a sophisticated artifact. A
compendium of different statistical models can be seen in the book of
Manning and Schutze (1999). Naranan and Balasubrahmanyan (1998) have
described the use of power law relations in modeling of languages. A
parsing language model incorporating multiple knowledge sources that is
based upon the concept of constraint Dependency Grammars is presented
in the paper of Wang and Harper (2002). A neural probabilistic language
model has been generated by Bengio et al (2003) while hierarchical
probabilistic neural network language model has been generated by Morin
and Bengio (2005). Different neural network language models can be seen
in the work of Bengio (2008). The book of Levinson (2003) presents
mathematical models of natural spoken language communication. Strauss
et al (2008) in the book Problems in Quantitative Linguistics have
discussed models for phoneme frequencies.
The frequency of letters in text has often been studied for use in
cryptography and frequency analysis. Modern International Morse code
encodes the most frequent letters with the shortest symbols and a similar
idea has been used in modern data-compression techniques also. Linotype
machines are also based on letter frequencies of English language texts.
Este
lar
-
Chapter 3
82
The reference for these applications can be had from Nation Master-
Encyclopedia (Letter frequencies).
No exact letter frequency distribution underlies a given language,
since all writers write slightly differently. More recent analyses show that
letter frequencies, like word frequencies, tend to vary, both by writer and
by subject. One cannot write an essay about x-rays without using frequent
Xs, and the essay will have an especially strange letter frequency if the
essay is about the frequent use of x-rays to treat zebras in Qatar. Different
authors have habits which can be reflected in their use of letters.
Hemingway's writing style, for example, is visibly different from
Faulkner's. Letter, bigram, trigram, word frequencies, word length, and
sentence length can be calculated for specific authors, and used to prove or
disprove authorship of texts, even for authors whose styles aren't so
divergent.
Since the patterns of letters in language dont happen at random, so
letter frequencies can be represented in the form of some rules and the
pattern of letter frequencies can help in discrimination of random text from
that of natural language text. In the area of letter-frequencies, we can cite
the works of Solso and King (1976); of Bell and Witten (1988) and
Eftekhari (2006). According to Sanderson (2007) distribution of letters in a
text can potentially help in the process of language determination. We have
tried to discuss the patterns of Hindi language alphabets on the basis of
their occurrence in texts and as word initials with the application of the
rank frequency approach. The detailed analyses and the modeling done in
this chapter is an attempt in the direction of the determination of the
linguistic structure of Hindi Language (on the basis of its letters). Most of
the previous studies for the letter frequencies have been concreted for
English and other languages.
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
83
3.2. Occurrence of letters in a text
We have initiated the work for present study by counting the
occurrence of letters of Hindi language alphabet in the text Tanav se
mukti (which has been taken as an example text for the study) of author
Shivanand (source has been given in appendix C, C.2.).Following aspects
have been considered in the counting process:
Independent occurrence of each letter (vowel as well as
consonant), as the occurrences of and in .
Each occurrence of consonants followed either by a vowel
mark in Devanagari script written as matra, as occurrence of
in , , , or occurrence in the form of half letters
anywhere in the word. For example we may cite the example
of occurrence of in and as such = + + +
+ + + + contains 2 , 1 , 1 and 1 and (
++ + ++ +) contains four letters , , , once each.
Occurrence of has been considered as the occurrence of vk
and of , vkWa, etc. with , vka etc. Occurrences in the form of ,
etc. have been merged with the occurrences of , [k and so
on.
In our selected text Tanav se mukti the total counted values of the
frequency of occurrence of each letter have been depicted in the table 3.2.1.
For obtaining the value - how many times each letter occurred in the text?,
Find and Find and replace options from Edit menu of Microsoft Word
Este
lar
-
Chapter 3
84
have been used and search option wildcard has also been applied
wherever needed.
Table 3.2.1. Frequencies of occurrence of letters for the text Tanav se
mukti.
Letter Frequency Letter Frequency Letter Frequency
711 0 84
517 661 716
99 194 805
167 848 1921
458 117 1553
17 3 4116
449 294 1128
23 128 1850
54 270 589
322 92 361
16 316 2645
0 3120 3178
2 543 : 125
4468 1260 143
341 490 > 25
679 3236
122 1858 Total 41,114
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
85
If we represent the total occurrence of all the letters of alphabet in
the text by N1, that is, =k
kfN ,1 k = 1, 2.. , where kf is the frequency
of occurrence of kth letter of the alphabet of Hindi language, then its value
( 1N ) for the selected text shall be equal to 41,114.
We have ranked the letters in descending order of their frequency of
occurrence as done by Zipf. The rank and frequency profile for each letter
can be visualized as shown in the following table 3.2.2-
Table 3.2.2. Rank frequency profile for the occurrence of letters.
Rank Frequency Rank Frequency Rank Frequency
1 4468 17 679 33 143
2 4116 18 661 34 128
3 3236 19 589 35 125
4 3178 20 543 36 122
5 3120 21 517 37 117
6 2645 22 490 38 99
7 1921 23 458 39 92
8 1858 24 449 40 84
9 1850 25 361 41 54
10 1553 26 341 42 25
11 1260 27 322 43 23
12 1128 28 316 44 17
13 848 29 294 45 16
14 805 30 270 46 3
15 716 31 194 47 2
16 711 32 167
According to Eftekhari (2006) if the letters are ordered in
accordance with their frequency of occurrence (according to their
Este
lar
-
Chapter 3
86
ascending orders of frequencies), the monotonic changes can be obtained
by the application of the Zipfs law. Such type of arrangement has been
given the name Zipfs order. For our text Tanav se Mukti the Zipfs
order for occurrence of letters in the text has been obtained as shown in the
following figure-
Figure 3.2.1 Zipfs orders for letters and their proportions.
3.3. Occurrence of letters as words initial
For determination of frequencies of occurrence of various words
initials in a text, the software available at:
http://www.seasite.niu.edu/trans/Indonesian/transpgms/latinfreq.aspx, for
obtaining the word frequency list corresponding to the text has been used
and frequencies of words with various initials have been determined from
the obtained list. Results obtained for the text Tanav se Mukti for
occurrence of letters as the initial letter of words in the studied text has
been tabulated in table 3.3.1.
====
KKKK
{k{ k{ k{ k
0.00E+00
2.00E-02
4.00E-02
6.00E-02
8.00E-02
1.00E-01
1.20E-01
0 10 20 30 40 50
Zipf's order of letters in the text(i)
Pro
po
rtio
n o
f let
ters
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
87
Table 3.3.1. Frequencies of different word initials in the text Tanav se
Mukti.
Letter Frequency Letter Frequency Letter Frequency
707 0 52
495 295 527
86 79 501
56 599 963
457 23 158
16 0 372
195 21 391
23 13 555
20 31 191
322 8 2
16 0 1581
0 419 2153
2 23 : 62
2708 672 4
64 133 > 9
155 550
70 1020 Total 16,799
Este
lar
-
Chapter 3
88
Here if N2 represents the total word tokens in the text (where words
formed by numerals 1, 2 etc. have been excluded), then N2 = 16,799. A
table similar to the table 3.2.2 can be constructed for the rank frequency
profile of words initials for which the Zipfs order can be demonstrated
with the help of following figure-
Figure 3.3.1. Zipfs orders for words initials
3.4. Mathematical models for frequencies of letters and
words initials
After obtaining the rank frequency profile of texts for different
letters and for different words initials we have tried to formulate these in
the form of mathematical equations. For this we have tested models
(available in literature for rank frequency distributions)- distributions
specified for rank frequency distribution of word frequencies, Zipfs
distribution and Zipf Mandelbrot distribution as given in the book by
Manning and Schutze (1999); various rank frequency distributions used for
====KKKK
{k{k{k{k
0
0.06
0.12
0.18
0 10 20 30 40 50Zipf's order(i) of word's initials
no
rmal
ized
fre
qu
enci
es
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
89
linguistic components as- Good, Geometric series distribution, Borodovsky
and Gusein Zade equation as cited in the research paper of Tambovtsev and
Martindale (2007) and Yule equation as concluded for the phoneme
frequencies by the authors; negative hypergeometric distribution and
determination of frequencies by Zipfs order. All these
models/distributions have been discussed in detail in section 2.3.1 of
chapter 2 in the form of equations (2.3.1) to equation (2.3.8).
Application of the distributions to the frequencies of different letters,
ranks of letters and their Zipf orders for the text Tanav se Mukti have
enabled us to calculate the theoretical values of the frequencies and the
values of determination coefficient ( 2R ) for each model. An outline of
determination coefficient ( 2R ) has been given by equation (1.4) of
Introductory Chapter 1.
The calculated values of determination coefficient for rank
frequency distribution of letters for their occurrences in the text and in the
starting position of the words in the text Tanav se Mukti have been
shown in the Table 3.4.1. We have used the software Datafit for
determination of parameters for all the equations except (2.3.2). This
software was producing numerical errors for equation (2.3.2) and as such
we were left with no option but to apply the least square method for linear
relation
)log()log()log()log( crBAcrbaFr ++=+= ,
by first setting c = 0 and then increasing it by small steps until the
goodness of fit stops improving.
Este
lar
-
Chapter 3
90
Table 3.4.1. Distributions, parameters and determination coefficients for
the text Tanav se Mukti
Distribution For occurrence of letters
in the text
For occurrence of letters
in the words initial
Det
erm
inat
ion
coef
fici
ent
Par
amet
ers
Det
erm
inat
ion
coef
fici
ent
Par
amet
ers
Zipf 0.836 5683.239,=a
b = 0.685
0.932 3051.361,=a
b = 0.815
Zipf Mandelbrot 0.985 ,10 491.7995144=a974.1142162=b
,107=c
0.925 ,10 571.5531903=a0218.816137=b,106 6=c
Good 0.900 & 0.825 &
Geometric series 0.9891 5007.781,=a b = 0.889
0.937 2806.074,=a b = 0.843
Borodovsky &
Gusein-Zade equation
0.866 & 0.772 &
Yule 0.9894 5072.373,=a b = 0.0382,
c = 0.8958
0.981 2982.119,=a
b = 0.459,
c = 0.936
Negative
hypergeometric
0.939 3.186,=a b = 0.6872
0.976 2.2503,=a
b = 0.337
br iaF = , i be Zipf
order of letter of
rank r
0.978 ,10113.2 4=a365.4=b
0.892 ,10663.3 5=a
b = -4.673
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
91
The table depicts that the highest determination coefficient in both
cases has been obtained for the Yule distribution (0.9894 and 0.981
respectively)-
r
brc
r
aF = (3.4.1),
where rF is the frequency corresponding to the rank r and a, b, c are
parameters. Therefore on the basis of results in the example text taken by
us, it can be assumed that the frequencies of occurrence of letters and
words initials in Hindi text follow Yules distribution. Corresponding
theoretical and empirical frequencies for the text Tanav se Mukti for
different ranks have been shown in the following two figures ( Figure 3.4.1
and 3.4.2) In the next section we shall check whether this distribution is
followed by the frequencies of occurrence in other texts also.
Figure 3.4.1. Observed and theoretical frequencies corresponding to
different ranks of letters.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 6 11 16 21 26 31 36 41 46
Rank
Fre
qu
ency
observed theoretical
Este
lar
-
Chapter 3
92
Figure 3.4.2. Observed and theoretical frequencies corresponding to
different ranks of words initials.
3.5. Validation of model for different texts and its
improvement for frequencies of words initials
We have counted the frequency of letters for various kinds of texts
obtained from different sources1 in the following forms-
1. Text composed of 5 stories.(say, Text 1)
2. Text containing 29 short stories.(Text 2)
3. Text containing 62 articles.(Text 3)
4. Text containing 37 articles.(Text 4)
5. Poetic text containing 180 poems (Text 5), and
6. Text formed as mixture of all the considered texts, from text
1 to text 5 above and the text Tanav se Mukti.(Text 6)
The calculated values of the determination coefficients, that are
greater than 0.98 for letter frequencies in the texts and greater than 0.95 for
1 Sources of the texts have been given in the Appendix C.
0
600
1200
1800
2400
3000
1 6 11 16 21 26 31 36 41Rank
Fre
quen
cy
observed theoretical
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
93
the letter frequencies in the words initial and the obtained Zipfs orders for
the occurrence of letters in text and for occurrence as words initial have
been given in the following tables:
Table 3.5.1(a). Calculated values of determination coefficients (which are
>0.98) for different texts for rank frequency distribution of letters.
Text
Total counted
letters(N1)
Determination
coefficient &
distribution
Determination
coefficient &
distribution
Determination
coefficient &
distribution
Text1 18,162 0.990(Yule) 0.989
(Geometric-
series)
0.981(Zipf -
Mandelbrot)
Text 2 22,375 0.996(Yule) 0.989
(Geometric-
series)
0.988(Zipf -
Mandelbrot)
Text 3 70,622 0.992(Yule) 0.981
(Negative-
hypergeometri
c)
0.980(Geometr
ic- series)
Text 4 82,803 0.992(Yule) 0.984
(Geometric-
series)
0.983(Zipf-
Mandelbrot)
Text 5 90,908 0.991(Yule) 0.990
(Geometric-
series)
&
Text 6 3,25,983 0.992(Yule) 0.990
(Geometric-
series)
0.984(Zipf -
Mandelbrot)
Este
lar
-
Chapter 3
94
Table 3.5.1(b). Zipfs order of letters for their instances in text
Text Zipfs order
Text 1 > :
Text 2 > :
Text 3 > :
Text 4 > :
Text 5 > :
Text 6 > :
Table 3.5.2(a). Zipfs order of letters for their instances in the beginning of
words in the texts
Text Zipfs order
Text 1 > :
Text 2 > :
Text 3 :
Text 4 > :
Text 5 > :
Text 6 > :
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
95
Table 3.5.2(b). Calculated values of determination coefficients (which are
>0.95) for different texts for rank frequency distribution of words initials.
Text
Total
counted
words(N2)
Determination
coefficient &
distribution
Determination
coefficient &
distribution
Determination
coefficient &
distribution
Text 1 8,298 0.978(Yule) 0.973(Negative
hypergeometric)
&
Text 2 9,831 0.960(Yule) 0.953(Negative
hypergeometric)
&
Text 3 29,623 0.975(Yule) 0.963(Negative
hypergeometric)
0.962(Zipf)
Text 4 34,404 0.984(Yule) 0.979( Negative
hypergeometric)
0.963(Zipf)
Text 5 40,418 0.987(Yule) 0.984(Negative
hypergeometric)
0.968(Geometric
series)
Text 6 1,39,373 0.985(Yule) 0.982(Negative
hypergeometric)
&
Tables 3.5.1(a) and 3.5.2(b) depict that the determination coefficient
in both cases is best for the Yule distribution for all the considered texts
and its obtained least value is 0.96, which is adequate to assume it as a
model for the frequencies. In table 3.5.1(b) letters - , , , , , ,
always take any position out of last seven positions in the corresponding
Zipfs order list and in the table 3.5.2(a) , , , , attain any one
position out of the last five positions in the Zipfs order list for the
occurrence of letters as words initial for all texts. Thus these seven letters
and five words initials can be assumed as seven letters of highest Zipfs
Este
lar
-
Chapter 3
96
orders or seven most frequent letters and five words initials of highest
Zipfs order or five most frequent words initials respectively.
The variations in the values of the parameters of the models for
different considered texts have been depicted in the following figures,
Figure 3.5.1 and 3.5.2. As the value of parameter a increases with the
size of text, for comparison we have drawn the values of 1N
a and 2N
a ,
where N1 denotes total counted letters in the text and N2 total counted word
tokens in the texts.
Figure 3.5.1. Parameters of the Yule distribution for different texts for rank
frequency distribution of letters.
Figure 3.5.2. Parameters of the Yule distribution for different texts for rank
frequency distribution of the word initials.
-0.1
0.4
0.9
Tex
t 1
Tex
t 2
Tex
t 3
Tex
t 4
Tex
t 5
Tex
t 6
Tan
av
a/N1 b c
-0.1
0.4
0.9
Tex
t 1
Tex
t 2
Tex
t 3
Tex
t 4
Tex
t 5
Tex
t 6
Ta
nav
a/N2 b cEste
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
97
From these figures, it is evident that although the parameters of the
model for the distributions of letters in the text and in the starting position
of words of the texts lie in slight different ranges but their manner of
change for the two is almost the same.
From the discussions carried out so far, we have observed that , ,
, , , , are seven most frequent letters of Hindi language and , , ,
, are five most frequent words initials. The percentage of the total
letters of the texts formed by these seven letters and percentage of the total
words, starting by these five words initials can be summarized in the form
of following table:
Table 3.5.3. % of total letters and % of total word tokens of the texts
formed by 7 most frequent letters and 5 most frequent words initials.
Tan
av
se m
ukti
Tex
t 1
Tex
t 2
Tex
t 3
Tex
t 4
Tex
t 5
Tex
t 6
% of total letters formed
by 7 most frequent letters 55.2 53.1 53.4 53.9 54.98 52.6 53.9
% of total word-tokens for
the words started by any
one of the 5 most
frequent words initials
50.2 42.4 42.6 46.3 47.9 46.2 46.6
Thus out of all used letters (excluding matras) of Hindi language
alphabet in a text more than 50% are one of , , , , , , and out of
all used words more than 40% words have their initial letter one of , , ,
Este
lar
-
Chapter 3
98
, . In the next section we have compared the entropies of these seven
most frequent letters in texts and five most frequent word initials.
Here we have taken an initiative in order to improve the values of
the determination coefficients in the form of replacement of r by )( r in
the Yule equation ,rbr
cr
aF = to take the form as
r
br
brc
r
kc
r
aF 11)()(
=
= (3.5.1),
where 1c , b1 and k are parameters and is a small number less than 1, for the rank frequency distribution of occurrence of letters as words initial, for
whom the value of determination coefficient for the texts text 1 and text 2
is not greater than 0.98. After this we have calculated the values of the
determination coefficients-
0.983 for text 1 by )( r and =0.9 ,
0.986 for text 2 by )( r and = 0.95,
0.99 for text 3 by )( r and = 0.95 ,
0.989 for text 4 by )( r and = 0.8 and
0.989 for text 6 by )( r and = 0.9; and for the text 5 and the text Tanav se Mukti there is not any considerable improvement. Thus the
model for the rank frequency distribution of letters in the beginning of
words of the text can be formulated as
r
brc
r
aF
)( = (3.5.2),
c, b and a are parameters and is either zero or a number between 0 and 1.
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
99
3.6. Entropies of most frequent consonants
We have so far investigated that all five most frequent word initials
and seven most frequent letters in text are consonants and each consonant
in a text can have three types of instances- as a consonant, as half
consonant(or consonant followed by a halant ) and as consonant
followed by any matra ( /, , , , , , , , , , , ). Thus
each consonant can occur in 14 different ways in text, where the
occurrences of in , ( + + , + + ) etc. have been
supposed as occurrence of followed by matra . In order to calculate
the entropy for the pattern of occurrence of most frequent consonants, we
have analyzed occurrences of each of seven most frequent letters in
different texts and the pattern of the occurrence of five most frequent
initials. The obtained pattern for the occurrence of seven most frequent
letters in the text Tanav se Mukti has been shown in the table 3.6.1,
given in the next page.
As the entropy (or the degree of uncertainty) of the distribution of a
random variable x is:
=Xx
xpxpxH )(log)()( 2 (3.6.1),
where p(x) is the probability of x. We can infer that if the random variable
X for a particular consonant is taken as the matra following that
consonant then X={ none, /, , , , , , , , , , , /, },
where occurrence of a consonant as a half letter has been assumed as the
occurrence followed by .
Este
lar
-
Chapter 3
100
Table 3.6.1. Frequencies for patterns of occurrences of most frequent
letters for the consecutive matra in the text Tanav se Mukti.
Pa
tter
n of
occu
rren
ce
L
ette
rs
Whole letter 1688 2550 577 1278 1053 827 785
With / 710 285 223 650 225 709 362
With 243 133 159 164 59 404 99
With 326 102 414 106 124 163 22
With 82 90 94 127 141 111 49
With 8 16 13 4 12 8 27
With 44 2 14 0 3 2 18
With 535 241 80 516 418 338 444
With 7 3 1113 3 3 7 13
With 493 219 465 43 13 137 19
With 19 0 3 10 15 1 3
With or as half letter 303 454 7 323 425 385 65
With / 3 19 16 3 154 7 14
With 7 2 0 9 0 21 1
Total 4468 4116 3178 3236 2645 3120 1921
The values of entropies obtained by the equation (3.6.1) for , , ,
, , , from the table (3.6.1) for the subsequently followed matras
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
101
(discussed above) are respectively 2.695, 2.008, 2.669, 2.474, 2.607, 2.808,
and 2.401. Similarly entropies of most frequent word initials , , , ,
are respectively 2.811, 1.992, 2.427, 2.666 and 2.114. The calculated
values of entropies for all the considered texts have been shown in the
following figures:
Figure 3.6.1. Entropies of seven most frequent letters for their pattern of
occurrence
Figure 3.6.2. Entropies of most frequent word-initials for their pattern of occurrence
1.5
2
2.5
3
letters
entr
op
y
Text 1
Text 2
Text 3
Text 4
Text 5
Text 6
Tanav seMukti
1.5
2
2.5
3
3.5
letters
entr
opy
Text 1
Text 2
Text 3
Text 4
Text 5
Text 6
Tanavse Mukti
Este
lar
-
Chapter 3
102
On the basis of this study applied to seven texts, we can say that the
letter is much certain (less entropic) than 6 other most frequent letters in
all the considered texts. It is followed by except in the text Tanav se
Mukti in which has less entropy than . In case of words initial the
occurrence pattern is that is much certain in the texts- Tanav se Mukti,
Text formed by 5 stories, the text formed by 29 short stories, the text
formed by poems and in the text formed as the mixture of all the texts. In
the texts formed by 62 articles is less entropic followed by and in the
text formed by 37 articles is much certain followed by and the
entropies of these two are nearly equal thus can be considered as much
certain words initial than other most frequent initials with regard to its
pattern of occurrence.
3.7. Conclusion
In this information age, storage and retrieval of information in a
convenient manner has gained importance. Because of the near-universal
adoption of the WorldWideWeb as a repository of information for
unconstrained and wide dissemination, information is now broadly
available over the internet and is accessible from remote sites.
However, the interaction between the computer and the user is
largely through keyboard and screen-oriented systems. In the current
Indian context, this restricts the usage to a minuscule fraction of the
population, who are both computer-literate and conversant with written
English. In order to enable a wider proportion of the population to benefit
from the internet, there is need to provide some additional interface with
Este
lar
-
Mathematical modeling of frequencies of letters in Hindi texts
103
the machine and hence the utility of such type of researches can be
understood.
The discussions in the present chapter can be concluded as-
, , , , , , are seven most frequent letters of Hindi
language and these form more than 50% of texts in terms of
letters of Hindi language alphabet.
, , , , are five most frequent words initials and more
than 40% of words of the text begin with any one of these letters.
Frequencies of letters in a text and in combinations of texts
follow Yule distribution ,rbr cr
aF = where a , b and c are
parameters.
Frequencies of letters as words initials follow a Yule
distribution ,rbr
cr
aF = which can be more properly expressed
by a distribution of type rbr cr
aF
)( = where is either zero
or lies between o and 1, and a , b and c are parameters.
The letter is much certain than other 6 most frequent letters
and is much certain words initial than other 4 most frequent
words initials in the form of their patterns of occurrence.
Este
lar
-
Este
lar