Chapter 3 Mathematical Modeling for Letter’s Frequencies...

Chapter 3

Mathematical Modeling for Letters Mathematical Modeling for Letters Mathematical Modeling for Letters Mathematical Modeling for Letters

Frequencies in Text and in Words Frequencies in Text and in Words Frequencies in Text and in Words Frequencies in Text and in Words

Initials for Hindi LanguageInitials for Hindi LanguageInitials for Hindi LanguageInitials for Hindi Language****

Frequencies of occurrence of letters in natural language texts are not

arbitrarily organized but follow some particular rules. Despite the apparent

freedom a writer has to create any desired sequence of words, written text

tends to follow some very simple set of laws. Letter frequencies enable us

to explain some characteristic features of human language. Words in

human language interact in sentences in non-random ways, and allow

humans to construct an astronomic variety of sentences from a limited

number of discrete units. Present chapter gives an analysis of frequencies

of letters of Hindi language alphabet on the basis of their occurrence in a

* Part of this chapter has been presented in Third Students conference of Linguistics in India, 19th-21st February, 2009, Centre for Linguistics, JNU, New Delhi, in the form of following research paper: Hemlata Pande and H. S. Dhami (2009). Analysis and mathematical model generation for letter frequencies in text and in words initials for Hindi language and measurement of entropies of most frequent consonants. (Abstract). The power point presentation of the paper is available at :

http://www.sconli.org/SCONLI_presentation/Hemlata.ppt

Contents from Chapter 2 and Chapter 3 in the form of a research paper Mathematical Modelling of Occurrence of Letters and Words Initials in Texts of Hindi Language has been accepted for publication in SKASE Journal of Theoretical Linguistics, Vol 7, 2010.

Este

lar

Chapter 3

80

text and also for their occurrence as words initial. We have applied Zipfs

law, which states that the frequency of the rth most frequent word type is

proportional to 1/r, to the letters instead of words for obtaining Zipfs order

and rank/frequency profiles of letters for their instances in text and in

beginning of words. Different distribution models have been tested for the

letter frequencies and finally appropriate parametric models have been

obtained for rank frequency distribution of letters in a corpus and for rank

frequency distribution of first letter of words in a Hindi text. Validations of

the models have been done by comparison of observed frequencies and

theoretical frequencies for various types of texts obtained from different

sources, and the variations in the values of parameters have been studied.

Comparison work has been accomplished for the entropies in text for the

vowel mark in Devanagari script written as matra following the consonant

(for the most frequent consonants).

3.1. Prelude

The properties of linguistic elements and their interrelations abide

by universal laws which are formulated in a strict mathematical way, in

analogy to the laws of well known natural sciences and it leads to the

application of some abstract mathematical and statistical tools to the

language theory. Mathematical Linguistics offers a number of tools that

allow abstract theories of language and mental computation to be compared

and contrasted with one another. An account of work in the field of

mathematical methods of linguistics can be had from the book of Partee et

al (1990), while compendium of several rules proposed to understand the

linguistic structure by which language communicates can be seen in the

work of Manning and Schutze (1999).

Este

lar

Mathematical modeling of frequencies of letters in Hindi texts

81

The process of formation of language models is quite popular in

linguistics. A linguistic model is a system of data (features, types,

structures, levels, etc.) and rules, which, taken together, can exhibit a

behavior similar to that of the human brain in understanding and

producing speech and texts. Mathematical models for languages had been

proposed by Harris (1982) in his famous book entitled A Grammar of

English on Mathematical Principles. In statistical language modeling,

large amounts of text are used to automatically determine the models

parameters. Bell and Witten (1988) has mentioned that natural-language

text is the result of an enormously complex process and modeling

strategies can be applied successfully to such a sophisticated artifact. A

compendium of different statistical models can be seen in the book of

Manning and Schutze (1999). Naranan and Balasubrahmanyan (1998) have

described the use of power law relations in modeling of languages. A

parsing language model incorporating multiple knowledge sources that is

based upon the concept of constraint Dependency Grammars is presented

in the paper of Wang and Harper (2002). A neural probabilistic language

model has been generated by Bengio et al (2003) while hierarchical

probabilistic neural network language model has been generated by Morin

and Bengio (2005). Different neural network language models can be seen

in the work of Bengio (2008). The book of Levinson (2003) presents

mathematical models of natural spoken language communication. Strauss

et al (2008) in the book Problems in Quantitative Linguistics have

discussed models for phoneme frequencies.

The frequency of letters in text has often been studied for use in

cryptography and frequency analysis. Modern International Morse code

encodes the most frequent letters with the shortest symbols and a similar

idea has been used in modern data-compression techniques also. Linotype

machines are also based on letter frequencies of English language texts.

Este

lar

Chapter 3

82

The reference for these applications can be had from Nation Master-

Encyclopedia (Letter frequencies).

No exact letter frequency distribution underlies a given language,

since all writers write slightly differently. More recent analyses show that

letter frequencies, like word frequencies, tend to vary, both by writer and

by subject. One cannot write an essay about x-rays without using frequent

Xs, and the essay will have an especially strange letter frequency if the

essay is about the frequent use of x-rays to treat zebras in Qatar. Different

authors have habits which can be reflected in their use of letters.

Hemingway's writing style, for example, is visibly different from

Faulkner's. Letter, bigram, trigram, word frequencies, word length, and

sentence length can be calculated for specific authors, and used to prove or

disprove authorship of texts, even for authors whose styles aren't so

divergent.

Since the patterns of letters in language dont happen at random, so

letter frequencies can be represented in the form of some rules and the

pattern of letter frequencies can help in discrimination of random text from

that of natural language text. In the area of letter-frequencies, we can cite

the works of Solso and King (1976); of Bell and Witten (1988) and

Eftekhari (2006). According to Sanderson (2007) distribution of letters in a

text can potentially help in the process of language determination. We have

tried to discuss the patterns of Hindi language alphabets on the basis of

their occurrence in texts and as word initials with the application of the

rank frequency approach. The detailed analyses and the modeling done in

this chapter is an attempt in the direction of the determination of the

linguistic structure of Hindi Language (on the basis of its letters). Most of

the previous studies for the letter frequencies have been concreted for

English and other languages.

Este

lar


83

3.2. Occurrence of letters in a text

We have initiated the work for present study by counting the

occurrence of letters of Hindi language alphabet in the text Tanav se

mukti (which has been taken as an example text for the study) of author

Shivanand (source has been given in appendix C, C.2.).Following aspects

have been considered in the counting process:

Independent occurrence of each letter (vowel as well as

consonant), as the occurrences of and in .

Each occurrence of consonants followed either by a vowel

mark in Devanagari script written as matra, as occurrence of

in , , , or occurrence in the form of half letters

anywhere in the word. For example we may cite the example

of occurrence of in and as such = + + +

+ + + + contains 2 , 1 , 1 and 1 and (

++ + ++ +) contains four letters , , , once each.

Occurrence of has been considered as the occurrence of vk

and of , vkWa, etc. with , vka etc. Occurrences in the form of ,

etc. have been merged with the occurrences of , [k and so

on.

In our selected text Tanav se mukti the total counted values of the

frequency of occurrence of each letter have been depicted in the table 3.2.1.

For obtaining the value - how many times each letter occurred in the text?,

Find and Find and replace options from Edit menu of Microsoft Word

Este

lar

Chapter 3

84

have been used and search option wildcard has also been applied

wherever needed.

Table 3.2.1. Frequencies of occurrence of letters for the text Tanav se

mukti.

Letter Frequency Letter Frequency Letter Frequency

711 0 84

517 661 716

99 194 805

167 848 1921

458 117 1553

17 3 4116

449 294 1128

23 128 1850

54 270 589

322 92 361

16 316 2645

0 3120 3178

2 543 : 125

4468 1260 143

341 490 > 25

679 3236

122 1858 Total 41,114

Este

lar


85

If we represent the total occurrence of all the letters of alphabet in

the text by N1, that is, =k

kfN ,1 k = 1, 2.. , where kf is the frequency

of occurrence of kth letter of the alphabet of Hindi language, then its value

( 1N ) for the selected text shall be equal to 41,114.

We have ranked the letters in descending order of their frequency of

occurrence as done by Zipf. The rank and frequency profile for each letter

can be visualized as shown in the following table 3.2.2-

Table 3.2.2. Rank frequency profile for the occurrence of letters.

Rank Frequency Rank Frequency Rank Frequency

1 4468 17 679 33 143

2 4116 18 661 34 128

3 3236 19 589 35 125

4 3178 20 543 36 122

5 3120 21 517 37 117

6 2645 22 490 38 99

7 1921 23 458 39 92

8 1858 24 449 40 84

9 1850 25 361 41 54

10 1553 26 341 42 25

11 1260 27 322 43 23

12 1128 28 316 44 17

13 848 29 294 45 16

14 805 30 270 46 3

15 716 31 194 47 2

16 711 32 167

According to Eftekhari (2006) if the letters are ordered in

accordance with their frequency of occurrence (according to their

Este

lar

Chapter 3

86

ascending orders of frequencies), the monotonic changes can be obtained

by the application of the Zipfs law. Such type of arrangement has been

given the name Zipfs order. For our text Tanav se Mukti the Zipfs

order for occurrence of letters in the text has been obtained as shown in the

following figure-

Figure 3.2.1 Zipfs orders for letters and their proportions.

3.3. Occurrence of letters as words initial

For determination of frequencies of occurrence of various words

initials in a text, the software available at:

http://www.seasite.niu.edu/trans/Indonesian/transpgms/latinfreq.aspx, for

obtaining the word frequency list corresponding to the text has been used

and frequencies of words with various initials have been determined from

the obtained list. Results obtained for the text Tanav se Mukti for

occurrence of letters as the initial letter of words in the studied text has

been tabulated in table 3.3.1.

====

KKKK

{k{ k{ k{ k

0.00E+00

2.00E-02

4.00E-02

6.00E-02

8.00E-02

1.00E-01

1.20E-01

0 10 20 30 40 50

Zipf's order of letters in the text(i)

Pro

po

rtio

n o

f let

ters

Este

lar


87

Table 3.3.1. Frequencies of different word initials in the text Tanav se

Mukti.

Letter Frequency Letter Frequency Letter Frequency

707 0 52

495 295 527

86 79 501

56 599 963

457 23 158

16 0 372

195 21 391

23 13 555

20 31 191

322 8 2

16 0 1581

0 419 2153

2 23 : 62

2708 672 4

64 133 > 9

155 550

70 1020 Total 16,799

Este

lar

Chapter 3

88

Here if N2 represents the total word tokens in the text (where words

formed by numerals 1, 2 etc. have been excluded), then N2 = 16,799. A

table similar to the table 3.2.2 can be constructed for the rank frequency

profile of words initials for which the Zipfs order can be demonstrated

with the help of following figure-

Figure 3.3.1. Zipfs orders for words initials

3.4. Mathematical models for frequencies of letters and

words initials

After obtaining the rank frequency profile of texts for different

letters and for different words initials we have tried to formulate these in

the form of mathematical equations. For this we have tested models

(available in literature for rank frequency distributions)- distributions

specified for rank frequency distribution of word frequencies, Zipfs

distribution and Zipf Mandelbrot distribution as given in the book by

Manning and Schutze (1999); various rank frequency distributions used for

====KKKK

{k{k{k{k

0

0.06

0.12

0.18

0 10 20 30 40 50Zipf's order(i) of word's initials

no

rmal

ized

fre

qu

enci

es

Este

lar


89

linguistic components as- Good, Geometric series distribution, Borodovsky

and Gusein Zade equation as cited in the research paper of Tambovtsev and

Martindale (2007) and Yule equation as concluded for the phoneme

frequencies by the authors; negative hypergeometric distribution and

determination of frequencies by Zipfs order. All these

models/distributions have been discussed in detail in section 2.3.1 of

chapter 2 in the form of equations (2.3.1) to equation (2.3.8).

Application of the distributions to the frequencies of different letters,

ranks of letters and their Zipf orders for the text Tanav se Mukti have

enabled us to calculate the theoretical values of the frequencies and the

values of determination coefficient ( 2R ) for each model. An outline of

determination coefficient ( 2R ) has been given by equation (1.4) of

Introductory Chapter 1.

The calculated values of determination coefficient for rank

frequency distribution of letters for their occurrences in the text and in the

starting position of the words in the text Tanav se Mukti have been

shown in the Table 3.4.1. We have used the software Datafit for

determination of parameters for all the equations except (2.3.2). This

software was producing numerical errors for equation (2.3.2) and as such

we were left with no option but to apply the least square method for linear

relation

)log()log()log()log( crBAcrbaFr ++=+= ,

by first setting c = 0 and then increasing it by small steps until the

goodness of fit stops improving.

Este

lar

Chapter 3

90

Table 3.4.1. Distributions, parameters and determination coefficients for

the text Tanav se Mukti

Distribution For occurrence of letters

in the text

For occurrence of letters

in the words initial

Det

erm

inat

ion

coef

fici

ent

Par

amet

ers

Det

erm

inat

ion

coef

fici

ent

Par

amet

ers

Zipf 0.836 5683.239,=a

b = 0.685

0.932 3051.361,=a

b = 0.815

Zipf Mandelbrot 0.985 ,10 491.7995144=a974.1142162=b

,107=c

0.925 ,10 571.5531903=a0218.816137=b,106 6=c

Good 0.900 & 0.825 &

Geometric series 0.9891 5007.781,=a b = 0.889

0.937 2806.074,=a b = 0.843

Borodovsky &

Gusein-Zade equation

0.866 & 0.772 &

Yule 0.9894 5072.373,=a b = 0.0382,

c = 0.8958

0.981 2982.119,=a

b = 0.459,

c = 0.936

Negative

hypergeometric

0.939 3.186,=a b = 0.6872

0.976 2.2503,=a

b = 0.337

br iaF = , i be Zipf

order of letter of

rank r

0.978 ,10113.2 4=a365.4=b

0.892 ,10663.3 5=a

b = -4.673

Este

lar


91

The table depicts that the highest determination coefficient in both

cases has been obtained for the Yule distribution (0.9894 and 0.981

respectively)-

r

brc

r

aF = (3.4.1),

where rF is the frequency corresponding to the rank r and a, b, c are

parameters. Therefore on the basis of results in the example text taken by

us, it can be assumed that the frequencies of occurrence of letters and

words initials in Hindi text follow Yules distribution. Corresponding

theoretical and empirical frequencies for the text Tanav se Mukti for

different ranks have been shown in the following two figures ( Figure 3.4.1

and 3.4.2) In the next section we shall check whether this distribution is

followed by the frequencies of occurrence in other texts also.

Figure 3.4.1. Observed and theoretical frequencies corresponding to

different ranks of letters.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 6 11 16 21 26 31 36 41 46

Rank

Fre

qu

ency

observed theoretical

Este

lar

Chapter 3

92

Figure 3.4.2. Observed and theoretical frequencies corresponding to

different ranks of words initials.

3.5. Validation of model for different texts and its

improvement for frequencies of words initials

We have counted the frequency of letters for various kinds of texts

obtained from different sources1 in the following forms-

1. Text composed of 5 stories.(say, Text 1)

2. Text containing 29 short stories.(Text 2)

3. Text containing 62 articles.(Text 3)

4. Text containing 37 articles.(Text 4)

5. Poetic text containing 180 poems (Text 5), and

6. Text formed as mixture of all the considered texts, from text

1 to text 5 above and the text Tanav se Mukti.(Text 6)

The calculated values of the determination coefficients, that are

greater than 0.98 for letter frequencies in the texts and greater than 0.95 for

1 Sources of the texts have been given in the Appendix C.

0

600

1200

1800

2400

3000

1 6 11 16 21 26 31 36 41Rank

Fre

quen

cy

observed theoretical

Este

lar


93

the letter frequencies in the words initial and the obtained Zipfs orders for

the occurrence of letters in text and for occurrence as words initial have

been given in the following tables:

Table 3.5.1(a). Calculated values of determination coefficients (which are

>0.98) for different texts for rank frequency distribution of letters.

Text

Total counted

letters(N1)

Determination

coefficient &

distribution

Determination

coefficient &

distribution

Determination

coefficient &

distribution

Text1 18,162 0.990(Yule) 0.989

(Geometric-

series)

0.981(Zipf -

Mandelbrot)

Text 2 22,375 0.996(Yule) 0.989

(Geometric-

series)

0.988(Zipf -

Mandelbrot)

Text 3 70,622 0.992(Yule) 0.981

(Negative-

hypergeometri

c)

0.980(Geometr

ic- series)

Text 4 82,803 0.992(Yule) 0.984

(Geometric-

series)

0.983(Zipf-

Mandelbrot)

Text 5 90,908 0.991(Yule) 0.990

(Geometric-

series)

&

Text 6 3,25,983 0.992(Yule) 0.990

(Geometric-

series)

0.984(Zipf -

Mandelbrot)

Este

lar

Chapter 3

94

Table 3.5.1(b). Zipfs order of letters for their instances in text

Text Zipfs order

Text 1 > :

Text 2 > :

Text 3 > :

Text 4 > :

Text 5 > :

Text 6 > :

Table 3.5.2(a). Zipfs order of letters for their instances in the beginning of

words in the texts

Text Zipfs order

Text 1 > :

Text 2 > :

Text 3 :

Text 4 > :

Text 5 > :

Text 6 > :

Este

lar


95

Table 3.5.2(b). Calculated values of determination coefficients (which are

>0.95) for different texts for rank frequency distribution of words initials.

Text

Total

counted

words(N2)

Determination

coefficient &

distribution

Determination

coefficient &

distribution

Determination

coefficient &

distribution

Text 1 8,298 0.978(Yule) 0.973(Negative

hypergeometric)

&


hypergeometric)

&


hypergeometric)

0.962(Zipf)

Text 4 34,404 0.984(Yule) 0.979( Negative

hypergeometric)

0.963(Zipf)


hypergeometric)

0.968(Geometric

series)

Text 6 1,39,373 0.985(Yule) 0.982(Negative

hypergeometric)

&

Tables 3.5.1(a) and 3.5.2(b) depict that the determination coefficient

in both cases is best for the Yule distribution for all the considered texts

and its obtained least value is 0.96, which is adequate to assume it as a

model for the frequencies. In table 3.5.1(b) letters - , , , , , ,

always take any position out of last seven positions in the corresponding

Zipfs order list and in the table 3.5.2(a) , , , , attain any one

position out of the last five positions in the Zipfs order list for the

occurrence of letters as words initial for all texts. Thus these seven letters

and five words initials can be assumed as seven letters of highest Zipfs

Este

lar

Chapter 3

96

orders or seven most frequent letters and five words initials of highest

Zipfs order or five most frequent words initials respectively.

The variations in the values of the parameters of the models for

different considered texts have been depicted in the following figures,

Figure 3.5.1 and 3.5.2. As the value of parameter a increases with the

size of text, for comparison we have drawn the values of 1N

a and 2N

a ,

where N1 denotes total counted letters in the text and N2 total counted word

tokens in the texts.

Figure 3.5.1. Parameters of the Yule distribution for different texts for rank

frequency distribution of letters.

Figure 3.5.2. Parameters of the Yule distribution for different texts for rank

frequency distribution of the word initials.

-0.1

0.4

0.9

Tex

t 1

Tex

t 2

Tex

t 3

Tex

t 4

Tex

t 5

Tex

t 6

Tan

av

a/N1 b c

-0.1

0.4

0.9

Tex

t 1

Tex

t 2

Tex

t 3

Tex

t 4

Tex

t 5

Tex

t 6

Ta

nav

a/N2 b cEste

lar


97

From these figures, it is evident that although the parameters of the

model for the distributions of letters in the text and in the starting position

of words of the texts lie in slight different ranges but their manner of

change for the two is almost the same.

From the discussions carried out so far, we have observed that , ,

, , , , are seven most frequent letters of Hindi language and , , ,

, are five most frequent words initials. The percentage of the total

letters of the texts formed by these seven letters and percentage of the total

words, starting by these five words initials can be summarized in the form

of following table:

Table 3.5.3. % of total letters and % of total word tokens of the texts

formed by 7 most frequent letters and 5 most frequent words initials.

Tan

av

se m

ukti

Tex

t 1

Tex

t 2

Tex

t 3

Tex

t 4

Tex

t 5

Tex

t 6

% of total letters formed

by 7 most frequent letters 55.2 53.1 53.4 53.9 54.98 52.6 53.9

% of total word-tokens for

the words started by any

one of the 5 most

frequent words initials

50.2 42.4 42.6 46.3 47.9 46.2 46.6

Thus out of all used letters (excluding matras) of Hindi language

alphabet in a text more than 50% are one of , , , , , , and out of

all used words more than 40% words have their initial letter one of , , ,

Este

lar

Chapter 3

98

, . In the next section we have compared the entropies of these seven

most frequent letters in texts and five most frequent word initials.

Here we have taken an initiative in order to improve the values of

the determination coefficients in the form of replacement of r by )( r in

the Yule equation ,rbr

cr

aF = to take the form as

r

br

brc

r

kc

r

aF 11)()(

=

= (3.5.1),

where 1c , b1 and k are parameters and is a small number less than 1, for the rank frequency distribution of occurrence of letters as words initial, for

whom the value of determination coefficient for the texts text 1 and text 2

is not greater than 0.98. After this we have calculated the values of the

determination coefficients-

0.983 for text 1 by )( r and =0.9 ,

0.986 for text 2 by )( r and = 0.95,

0.99 for text 3 by )( r and = 0.95 ,

0.989 for text 4 by )( r and = 0.8 and

0.989 for text 6 by )( r and = 0.9; and for the text 5 and the text Tanav se Mukti there is not any considerable improvement. Thus the

model for the rank frequency distribution of letters in the beginning of

words of the text can be formulated as

r

brc

r

aF

)( = (3.5.2),

c, b and a are parameters and is either zero or a number between 0 and 1.

Este

lar


99

3.6. Entropies of most frequent consonants

We have so far investigated that all five most frequent word initials

and seven most frequent letters in text are consonants and each consonant

in a text can have three types of instances- as a consonant, as half

consonant(or consonant followed by a halant ) and as consonant

followed by any matra ( /, , , , , , , , , , , ). Thus

each consonant can occur in 14 different ways in text, where the

occurrences of in , ( + + , + + ) etc. have been

supposed as occurrence of followed by matra . In order to calculate

the entropy for the pattern of occurrence of most frequent consonants, we

have analyzed occurrences of each of seven most frequent letters in

different texts and the pattern of the occurrence of five most frequent

initials. The obtained pattern for the occurrence of seven most frequent

letters in the text Tanav se Mukti has been shown in the table 3.6.1,

given in the next page.

As the entropy (or the degree of uncertainty) of the distribution of a

random variable x is:

=Xx

xpxpxH )(log)()( 2 (3.6.1),

where p(x) is the probability of x. We can infer that if the random variable

X for a particular consonant is taken as the matra following that

consonant then X={ none, /, , , , , , , , , , , /, },

where occurrence of a consonant as a half letter has been assumed as the

occurrence followed by .

Este

lar

Chapter 3

100

Table 3.6.1. Frequencies for patterns of occurrences of most frequent

letters for the consecutive matra in the text Tanav se Mukti.

Pa

tter

n of

occu

rren

ce

L

ette

rs

Whole letter 1688 2550 577 1278 1053 827 785

With / 710 285 223 650 225 709 362

With 243 133 159 164 59 404 99

With 326 102 414 106 124 163 22

With 82 90 94 127 141 111 49

With 8 16 13 4 12 8 27

With 44 2 14 0 3 2 18

With 535 241 80 516 418 338 444

With 7 3 1113 3 3 7 13

With 493 219 465 43 13 137 19

With 19 0 3 10 15 1 3

With or as half letter 303 454 7 323 425 385 65

With / 3 19 16 3 154 7 14

With 7 2 0 9 0 21 1

Total 4468 4116 3178 3236 2645 3120 1921

The values of entropies obtained by the equation (3.6.1) for , , ,

, , , from the table (3.6.1) for the subsequently followed matras

Este

lar


101

(discussed above) are respectively 2.695, 2.008, 2.669, 2.474, 2.607, 2.808,

and 2.401. Similarly entropies of most frequent word initials , , , ,

are respectively 2.811, 1.992, 2.427, 2.666 and 2.114. The calculated

values of entropies for all the considered texts have been shown in the

following figures:

Figure 3.6.1. Entropies of seven most frequent letters for their pattern of

occurrence

Figure 3.6.2. Entropies of most frequent word-initials for their pattern of occurrence

1.5

2

2.5

3

letters

entr

op

y

Text 1

Text 2

Text 3

Text 4

Text 5

Text 6

Tanav seMukti

1.5

2

2.5

3

3.5

letters

entr

opy

Text 1

Text 2

Text 3

Text 4

Text 5

Text 6

Tanavse Mukti

Este

lar

Chapter 3

102

On the basis of this study applied to seven texts, we can say that the

letter is much certain (less entropic) than 6 other most frequent letters in

all the considered texts. It is followed by except in the text Tanav se

Mukti in which has less entropy than . In case of words initial the

occurrence pattern is that is much certain in the texts- Tanav se Mukti,

Text formed by 5 stories, the text formed by 29 short stories, the text

formed by poems and in the text formed as the mixture of all the texts. In

the texts formed by 62 articles is less entropic followed by and in the

text formed by 37 articles is much certain followed by and the

entropies of these two are nearly equal thus can be considered as much

certain words initial than other most frequent initials with regard to its

pattern of occurrence.

3.7. Conclusion

In this information age, storage and retrieval of information in a

convenient manner has gained importance. Because of the near-universal

adoption of the WorldWideWeb as a repository of information for

unconstrained and wide dissemination, information is now broadly

available over the internet and is accessible from remote sites.

However, the interaction between the computer and the user is

largely through keyboard and screen-oriented systems. In the current

Indian context, this restricts the usage to a minuscule fraction of the

population, who are both computer-literate and conversant with written

English. In order to enable a wider proportion of the population to benefit

from the internet, there is need to provide some additional interface with

Este

lar


103

the machine and hence the utility of such type of researches can be

understood.

The discussions in the present chapter can be concluded as-

, , , , , , are seven most frequent letters of Hindi

language and these form more than 50% of texts in terms of

letters of Hindi language alphabet.

, , , , are five most frequent words initials and more

than 40% of words of the text begin with any one of these letters.

Frequencies of letters in a text and in combinations of texts

follow Yule distribution ,rbr cr

aF = where a , b and c are

parameters.

Frequencies of letters as words initials follow a Yule

distribution ,rbr

cr

aF = which can be more properly expressed

by a distribution of type rbr cr

aF

)( = where is either zero

or lies between o and 1, and a , b and c are parameters.

The letter is much certain than other 6 most frequent letters

and is much certain words initial than other 4 most frequent

words initials in the form of their patterns of occurrence.

Este

lar

Este

lar

Chapter 3 Mathematical Modeling for Letter’s Frequencies...

Documents

Transcript of Chapter 3 Mathematical Modeling for Letter’s Frequencies...