Chapter 3 Mathematical Modeling for Letter’s Frequencies...

26
Chapter 3 Mathematical Modeling for Letter’s Mathematical Modeling for Letter’s Mathematical Modeling for Letter’s Mathematical Modeling for Letter’s Frequencies in Text and in Words’ Frequencies in Text and in Words’ Frequencies in Text and in Words’ Frequencies in Text and in Words’ Initials for Hindi Language Initials for Hindi Language Initials for Hindi Language Initials for Hindi Language * Frequencies of occurrence of letters in natural language texts are not arbitrarily organized but follow some particular rules. Despite the apparent freedom a writer has to create any desired sequence of words, written text tends to follow some very simple set of laws. Letter frequencies enable us to explain some characteristic features of human language. Words in human language interact in sentences in non-random ways, and allow humans to construct an astronomic variety of sentences from a limited number of discrete units. Present chapter gives an analysis of frequencies of letters of Hindi language alphabet on the basis of their occurrence in a * Part of this chapter has been presented in Third Students’ conference of Linguistics in India, 19 th -21 st February, 2009, Centre for Linguistics, JNU, New Delhi, in the form of following research paper: Hemlata Pande and H. S. Dhami (2009). Analysis and mathematical model generation for letter frequencies in text and in word’s initials for Hindi language and measurement of entropies of most frequent consonants. (Abstract). The power point presentation of the paper is available at : http://www.sconli.org/SCONLI_presentation/Hemlata.ppt Contents from Chapter 2 and Chapter 3 in the form of a research paper “Mathematical Modelling of Occurrence of Letters and Word’s Initials in Texts of Hindi Language” has been accepted for publication in SKASE Journal of Theoretical Linguistics, Vol 7, 2010. Estelar

Transcript of Chapter 3 Mathematical Modeling for Letter’s Frequencies...

  • Chapter 3

    Mathematical Modeling for Letters Mathematical Modeling for Letters Mathematical Modeling for Letters Mathematical Modeling for Letters

    Frequencies in Text and in Words Frequencies in Text and in Words Frequencies in Text and in Words Frequencies in Text and in Words

    Initials for Hindi LanguageInitials for Hindi LanguageInitials for Hindi LanguageInitials for Hindi Language****

    Frequencies of occurrence of letters in natural language texts are not

    arbitrarily organized but follow some particular rules. Despite the apparent

    freedom a writer has to create any desired sequence of words, written text

    tends to follow some very simple set of laws. Letter frequencies enable us

    to explain some characteristic features of human language. Words in

    human language interact in sentences in non-random ways, and allow

    humans to construct an astronomic variety of sentences from a limited

    number of discrete units. Present chapter gives an analysis of frequencies

    of letters of Hindi language alphabet on the basis of their occurrence in a

    * Part of this chapter has been presented in Third Students conference of Linguistics in India, 19th-21st February, 2009, Centre for Linguistics, JNU, New Delhi, in the form of following research paper: Hemlata Pande and H. S. Dhami (2009). Analysis and mathematical model generation for letter frequencies in text and in words initials for Hindi language and measurement of entropies of most frequent consonants. (Abstract). The power point presentation of the paper is available at :

    http://www.sconli.org/SCONLI_presentation/Hemlata.ppt

    Contents from Chapter 2 and Chapter 3 in the form of a research paper Mathematical Modelling of Occurrence of Letters and Words Initials in Texts of Hindi Language has been accepted for publication in SKASE Journal of Theoretical Linguistics, Vol 7, 2010.

    Este

    lar

  • Chapter 3

    80

    text and also for their occurrence as words initial. We have applied Zipfs

    law, which states that the frequency of the rth most frequent word type is

    proportional to 1/r, to the letters instead of words for obtaining Zipfs order

    and rank/frequency profiles of letters for their instances in text and in

    beginning of words. Different distribution models have been tested for the

    letter frequencies and finally appropriate parametric models have been

    obtained for rank frequency distribution of letters in a corpus and for rank

    frequency distribution of first letter of words in a Hindi text. Validations of

    the models have been done by comparison of observed frequencies and

    theoretical frequencies for various types of texts obtained from different

    sources, and the variations in the values of parameters have been studied.

    Comparison work has been accomplished for the entropies in text for the

    vowel mark in Devanagari script written as matra following the consonant

    (for the most frequent consonants).

    3.1. Prelude

    The properties of linguistic elements and their interrelations abide

    by universal laws which are formulated in a strict mathematical way, in

    analogy to the laws of well known natural sciences and it leads to the

    application of some abstract mathematical and statistical tools to the

    language theory. Mathematical Linguistics offers a number of tools that

    allow abstract theories of language and mental computation to be compared

    and contrasted with one another. An account of work in the field of

    mathematical methods of linguistics can be had from the book of Partee et

    al (1990), while compendium of several rules proposed to understand the

    linguistic structure by which language communicates can be seen in the

    work of Manning and Schutze (1999).

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    81

    The process of formation of language models is quite popular in

    linguistics. A linguistic model is a system of data (features, types,

    structures, levels, etc.) and rules, which, taken together, can exhibit a

    behavior similar to that of the human brain in understanding and

    producing speech and texts. Mathematical models for languages had been

    proposed by Harris (1982) in his famous book entitled A Grammar of

    English on Mathematical Principles. In statistical language modeling,

    large amounts of text are used to automatically determine the models

    parameters. Bell and Witten (1988) has mentioned that natural-language

    text is the result of an enormously complex process and modeling

    strategies can be applied successfully to such a sophisticated artifact. A

    compendium of different statistical models can be seen in the book of

    Manning and Schutze (1999). Naranan and Balasubrahmanyan (1998) have

    described the use of power law relations in modeling of languages. A

    parsing language model incorporating multiple knowledge sources that is

    based upon the concept of constraint Dependency Grammars is presented

    in the paper of Wang and Harper (2002). A neural probabilistic language

    model has been generated by Bengio et al (2003) while hierarchical

    probabilistic neural network language model has been generated by Morin

    and Bengio (2005). Different neural network language models can be seen

    in the work of Bengio (2008). The book of Levinson (2003) presents

    mathematical models of natural spoken language communication. Strauss

    et al (2008) in the book Problems in Quantitative Linguistics have

    discussed models for phoneme frequencies.

    The frequency of letters in text has often been studied for use in

    cryptography and frequency analysis. Modern International Morse code

    encodes the most frequent letters with the shortest symbols and a similar

    idea has been used in modern data-compression techniques also. Linotype

    machines are also based on letter frequencies of English language texts.

    Este

    lar

  • Chapter 3

    82

    The reference for these applications can be had from Nation Master-

    Encyclopedia (Letter frequencies).

    No exact letter frequency distribution underlies a given language,

    since all writers write slightly differently. More recent analyses show that

    letter frequencies, like word frequencies, tend to vary, both by writer and

    by subject. One cannot write an essay about x-rays without using frequent

    Xs, and the essay will have an especially strange letter frequency if the

    essay is about the frequent use of x-rays to treat zebras in Qatar. Different

    authors have habits which can be reflected in their use of letters.

    Hemingway's writing style, for example, is visibly different from

    Faulkner's. Letter, bigram, trigram, word frequencies, word length, and

    sentence length can be calculated for specific authors, and used to prove or

    disprove authorship of texts, even for authors whose styles aren't so

    divergent.

    Since the patterns of letters in language dont happen at random, so

    letter frequencies can be represented in the form of some rules and the

    pattern of letter frequencies can help in discrimination of random text from

    that of natural language text. In the area of letter-frequencies, we can cite

    the works of Solso and King (1976); of Bell and Witten (1988) and

    Eftekhari (2006). According to Sanderson (2007) distribution of letters in a

    text can potentially help in the process of language determination. We have

    tried to discuss the patterns of Hindi language alphabets on the basis of

    their occurrence in texts and as word initials with the application of the

    rank frequency approach. The detailed analyses and the modeling done in

    this chapter is an attempt in the direction of the determination of the

    linguistic structure of Hindi Language (on the basis of its letters). Most of

    the previous studies for the letter frequencies have been concreted for

    English and other languages.

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    83

    3.2. Occurrence of letters in a text

    We have initiated the work for present study by counting the

    occurrence of letters of Hindi language alphabet in the text Tanav se

    mukti (which has been taken as an example text for the study) of author

    Shivanand (source has been given in appendix C, C.2.).Following aspects

    have been considered in the counting process:

    Independent occurrence of each letter (vowel as well as

    consonant), as the occurrences of and in .

    Each occurrence of consonants followed either by a vowel

    mark in Devanagari script written as matra, as occurrence of

    in , , , or occurrence in the form of half letters

    anywhere in the word. For example we may cite the example

    of occurrence of in and as such = + + +

    + + + + contains 2 , 1 , 1 and 1 and (

    ++ + ++ +) contains four letters , , , once each.

    Occurrence of has been considered as the occurrence of vk

    and of , vkWa, etc. with , vka etc. Occurrences in the form of ,

    etc. have been merged with the occurrences of , [k and so

    on.

    In our selected text Tanav se mukti the total counted values of the

    frequency of occurrence of each letter have been depicted in the table 3.2.1.

    For obtaining the value - how many times each letter occurred in the text?,

    Find and Find and replace options from Edit menu of Microsoft Word

    Este

    lar

  • Chapter 3

    84

    have been used and search option wildcard has also been applied

    wherever needed.

    Table 3.2.1. Frequencies of occurrence of letters for the text Tanav se

    mukti.

    Letter Frequency Letter Frequency Letter Frequency

    711 0 84

    517 661 716

    99 194 805

    167 848 1921

    458 117 1553

    17 3 4116

    449 294 1128

    23 128 1850

    54 270 589

    322 92 361

    16 316 2645

    0 3120 3178

    2 543 : 125

    4468 1260 143

    341 490 > 25

    679 3236

    122 1858 Total 41,114

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    85

    If we represent the total occurrence of all the letters of alphabet in

    the text by N1, that is, =k

    kfN ,1 k = 1, 2.. , where kf is the frequency

    of occurrence of kth letter of the alphabet of Hindi language, then its value

    ( 1N ) for the selected text shall be equal to 41,114.

    We have ranked the letters in descending order of their frequency of

    occurrence as done by Zipf. The rank and frequency profile for each letter

    can be visualized as shown in the following table 3.2.2-

    Table 3.2.2. Rank frequency profile for the occurrence of letters.

    Rank Frequency Rank Frequency Rank Frequency

    1 4468 17 679 33 143

    2 4116 18 661 34 128

    3 3236 19 589 35 125

    4 3178 20 543 36 122

    5 3120 21 517 37 117

    6 2645 22 490 38 99

    7 1921 23 458 39 92

    8 1858 24 449 40 84

    9 1850 25 361 41 54

    10 1553 26 341 42 25

    11 1260 27 322 43 23

    12 1128 28 316 44 17

    13 848 29 294 45 16

    14 805 30 270 46 3

    15 716 31 194 47 2

    16 711 32 167

    According to Eftekhari (2006) if the letters are ordered in

    accordance with their frequency of occurrence (according to their

    Este

    lar

  • Chapter 3

    86

    ascending orders of frequencies), the monotonic changes can be obtained

    by the application of the Zipfs law. Such type of arrangement has been

    given the name Zipfs order. For our text Tanav se Mukti the Zipfs

    order for occurrence of letters in the text has been obtained as shown in the

    following figure-

    Figure 3.2.1 Zipfs orders for letters and their proportions.

    3.3. Occurrence of letters as words initial

    For determination of frequencies of occurrence of various words

    initials in a text, the software available at:

    http://www.seasite.niu.edu/trans/Indonesian/transpgms/latinfreq.aspx, for

    obtaining the word frequency list corresponding to the text has been used

    and frequencies of words with various initials have been determined from

    the obtained list. Results obtained for the text Tanav se Mukti for

    occurrence of letters as the initial letter of words in the studied text has

    been tabulated in table 3.3.1.

    ====

    KKKK

    {k{ k{ k{ k

    0.00E+00

    2.00E-02

    4.00E-02

    6.00E-02

    8.00E-02

    1.00E-01

    1.20E-01

    0 10 20 30 40 50

    Zipf's order of letters in the text(i)

    Pro

    po

    rtio

    n o

    f let

    ters

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    87

    Table 3.3.1. Frequencies of different word initials in the text Tanav se

    Mukti.

    Letter Frequency Letter Frequency Letter Frequency

    707 0 52

    495 295 527

    86 79 501

    56 599 963

    457 23 158

    16 0 372

    195 21 391

    23 13 555

    20 31 191

    322 8 2

    16 0 1581

    0 419 2153

    2 23 : 62

    2708 672 4

    64 133 > 9

    155 550

    70 1020 Total 16,799

    Este

    lar

  • Chapter 3

    88

    Here if N2 represents the total word tokens in the text (where words

    formed by numerals 1, 2 etc. have been excluded), then N2 = 16,799. A

    table similar to the table 3.2.2 can be constructed for the rank frequency

    profile of words initials for which the Zipfs order can be demonstrated

    with the help of following figure-

    Figure 3.3.1. Zipfs orders for words initials

    3.4. Mathematical models for frequencies of letters and

    words initials

    After obtaining the rank frequency profile of texts for different

    letters and for different words initials we have tried to formulate these in

    the form of mathematical equations. For this we have tested models

    (available in literature for rank frequency distributions)- distributions

    specified for rank frequency distribution of word frequencies, Zipfs

    distribution and Zipf Mandelbrot distribution as given in the book by

    Manning and Schutze (1999); various rank frequency distributions used for

    ====KKKK

    {k{k{k{k

    0

    0.06

    0.12

    0.18

    0 10 20 30 40 50Zipf's order(i) of word's initials

    no

    rmal

    ized

    fre

    qu

    enci

    es

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    89

    linguistic components as- Good, Geometric series distribution, Borodovsky

    and Gusein Zade equation as cited in the research paper of Tambovtsev and

    Martindale (2007) and Yule equation as concluded for the phoneme

    frequencies by the authors; negative hypergeometric distribution and

    determination of frequencies by Zipfs order. All these

    models/distributions have been discussed in detail in section 2.3.1 of

    chapter 2 in the form of equations (2.3.1) to equation (2.3.8).

    Application of the distributions to the frequencies of different letters,

    ranks of letters and their Zipf orders for the text Tanav se Mukti have

    enabled us to calculate the theoretical values of the frequencies and the

    values of determination coefficient ( 2R ) for each model. An outline of

    determination coefficient ( 2R ) has been given by equation (1.4) of

    Introductory Chapter 1.

    The calculated values of determination coefficient for rank

    frequency distribution of letters for their occurrences in the text and in the

    starting position of the words in the text Tanav se Mukti have been

    shown in the Table 3.4.1. We have used the software Datafit for

    determination of parameters for all the equations except (2.3.2). This

    software was producing numerical errors for equation (2.3.2) and as such

    we were left with no option but to apply the least square method for linear

    relation

    )log()log()log()log( crBAcrbaFr ++=+= ,

    by first setting c = 0 and then increasing it by small steps until the

    goodness of fit stops improving.

    Este

    lar

  • Chapter 3

    90

    Table 3.4.1. Distributions, parameters and determination coefficients for

    the text Tanav se Mukti

    Distribution For occurrence of letters

    in the text

    For occurrence of letters

    in the words initial

    Det

    erm

    inat

    ion

    coef

    fici

    ent

    Par

    amet

    ers

    Det

    erm

    inat

    ion

    coef

    fici

    ent

    Par

    amet

    ers

    Zipf 0.836 5683.239,=a

    b = 0.685

    0.932 3051.361,=a

    b = 0.815

    Zipf Mandelbrot 0.985 ,10 491.7995144=a974.1142162=b

    ,107=c

    0.925 ,10 571.5531903=a0218.816137=b,106 6=c

    Good 0.900 & 0.825 &

    Geometric series 0.9891 5007.781,=a b = 0.889

    0.937 2806.074,=a b = 0.843

    Borodovsky &

    Gusein-Zade equation

    0.866 & 0.772 &

    Yule 0.9894 5072.373,=a b = 0.0382,

    c = 0.8958

    0.981 2982.119,=a

    b = 0.459,

    c = 0.936

    Negative

    hypergeometric

    0.939 3.186,=a b = 0.6872

    0.976 2.2503,=a

    b = 0.337

    br iaF = , i be Zipf

    order of letter of

    rank r

    0.978 ,10113.2 4=a365.4=b

    0.892 ,10663.3 5=a

    b = -4.673

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    91

    The table depicts that the highest determination coefficient in both

    cases has been obtained for the Yule distribution (0.9894 and 0.981

    respectively)-

    r

    brc

    r

    aF = (3.4.1),

    where rF is the frequency corresponding to the rank r and a, b, c are

    parameters. Therefore on the basis of results in the example text taken by

    us, it can be assumed that the frequencies of occurrence of letters and

    words initials in Hindi text follow Yules distribution. Corresponding

    theoretical and empirical frequencies for the text Tanav se Mukti for

    different ranks have been shown in the following two figures ( Figure 3.4.1

    and 3.4.2) In the next section we shall check whether this distribution is

    followed by the frequencies of occurrence in other texts also.

    Figure 3.4.1. Observed and theoretical frequencies corresponding to

    different ranks of letters.

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    5000

    1 6 11 16 21 26 31 36 41 46

    Rank

    Fre

    qu

    ency

    observed theoretical

    Este

    lar

  • Chapter 3

    92

    Figure 3.4.2. Observed and theoretical frequencies corresponding to

    different ranks of words initials.

    3.5. Validation of model for different texts and its

    improvement for frequencies of words initials

    We have counted the frequency of letters for various kinds of texts

    obtained from different sources1 in the following forms-

    1. Text composed of 5 stories.(say, Text 1)

    2. Text containing 29 short stories.(Text 2)

    3. Text containing 62 articles.(Text 3)

    4. Text containing 37 articles.(Text 4)

    5. Poetic text containing 180 poems (Text 5), and

    6. Text formed as mixture of all the considered texts, from text

    1 to text 5 above and the text Tanav se Mukti.(Text 6)

    The calculated values of the determination coefficients, that are

    greater than 0.98 for letter frequencies in the texts and greater than 0.95 for

    1 Sources of the texts have been given in the Appendix C.

    0

    600

    1200

    1800

    2400

    3000

    1 6 11 16 21 26 31 36 41Rank

    Fre

    quen

    cy

    observed theoretical

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    93

    the letter frequencies in the words initial and the obtained Zipfs orders for

    the occurrence of letters in text and for occurrence as words initial have

    been given in the following tables:

    Table 3.5.1(a). Calculated values of determination coefficients (which are

    >0.98) for different texts for rank frequency distribution of letters.

    Text

    Total counted

    letters(N1)

    Determination

    coefficient &

    distribution

    Determination

    coefficient &

    distribution

    Determination

    coefficient &

    distribution

    Text1 18,162 0.990(Yule) 0.989

    (Geometric-

    series)

    0.981(Zipf -

    Mandelbrot)

    Text 2 22,375 0.996(Yule) 0.989

    (Geometric-

    series)

    0.988(Zipf -

    Mandelbrot)

    Text 3 70,622 0.992(Yule) 0.981

    (Negative-

    hypergeometri

    c)

    0.980(Geometr

    ic- series)

    Text 4 82,803 0.992(Yule) 0.984

    (Geometric-

    series)

    0.983(Zipf-

    Mandelbrot)

    Text 5 90,908 0.991(Yule) 0.990

    (Geometric-

    series)

    &

    Text 6 3,25,983 0.992(Yule) 0.990

    (Geometric-

    series)

    0.984(Zipf -

    Mandelbrot)

    Este

    lar

  • Chapter 3

    94

    Table 3.5.1(b). Zipfs order of letters for their instances in text

    Text Zipfs order

    Text 1 > :

    Text 2 > :

    Text 3 > :

    Text 4 > :

    Text 5 > :

    Text 6 > :

    Table 3.5.2(a). Zipfs order of letters for their instances in the beginning of

    words in the texts

    Text Zipfs order

    Text 1 > :

    Text 2 > :

    Text 3 :

    Text 4 > :

    Text 5 > :

    Text 6 > :

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    95

    Table 3.5.2(b). Calculated values of determination coefficients (which are

    >0.95) for different texts for rank frequency distribution of words initials.

    Text

    Total

    counted

    words(N2)

    Determination

    coefficient &

    distribution

    Determination

    coefficient &

    distribution

    Determination

    coefficient &

    distribution

    Text 1 8,298 0.978(Yule) 0.973(Negative

    hypergeometric)

    &

    Text 2 9,831 0.960(Yule) 0.953(Negative

    hypergeometric)

    &

    Text 3 29,623 0.975(Yule) 0.963(Negative

    hypergeometric)

    0.962(Zipf)

    Text 4 34,404 0.984(Yule) 0.979( Negative

    hypergeometric)

    0.963(Zipf)

    Text 5 40,418 0.987(Yule) 0.984(Negative

    hypergeometric)

    0.968(Geometric

    series)

    Text 6 1,39,373 0.985(Yule) 0.982(Negative

    hypergeometric)

    &

    Tables 3.5.1(a) and 3.5.2(b) depict that the determination coefficient

    in both cases is best for the Yule distribution for all the considered texts

    and its obtained least value is 0.96, which is adequate to assume it as a

    model for the frequencies. In table 3.5.1(b) letters - , , , , , ,

    always take any position out of last seven positions in the corresponding

    Zipfs order list and in the table 3.5.2(a) , , , , attain any one

    position out of the last five positions in the Zipfs order list for the

    occurrence of letters as words initial for all texts. Thus these seven letters

    and five words initials can be assumed as seven letters of highest Zipfs

    Este

    lar

  • Chapter 3

    96

    orders or seven most frequent letters and five words initials of highest

    Zipfs order or five most frequent words initials respectively.

    The variations in the values of the parameters of the models for

    different considered texts have been depicted in the following figures,

    Figure 3.5.1 and 3.5.2. As the value of parameter a increases with the

    size of text, for comparison we have drawn the values of 1N

    a and 2N

    a ,

    where N1 denotes total counted letters in the text and N2 total counted word

    tokens in the texts.

    Figure 3.5.1. Parameters of the Yule distribution for different texts for rank

    frequency distribution of letters.

    Figure 3.5.2. Parameters of the Yule distribution for different texts for rank

    frequency distribution of the word initials.

    -0.1

    0.4

    0.9

    Tex

    t 1

    Tex

    t 2

    Tex

    t 3

    Tex

    t 4

    Tex

    t 5

    Tex

    t 6

    Tan

    av

    a/N1 b c

    -0.1

    0.4

    0.9

    Tex

    t 1

    Tex

    t 2

    Tex

    t 3

    Tex

    t 4

    Tex

    t 5

    Tex

    t 6

    Ta

    nav

    a/N2 b cEste

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    97

    From these figures, it is evident that although the parameters of the

    model for the distributions of letters in the text and in the starting position

    of words of the texts lie in slight different ranges but their manner of

    change for the two is almost the same.

    From the discussions carried out so far, we have observed that , ,

    , , , , are seven most frequent letters of Hindi language and , , ,

    , are five most frequent words initials. The percentage of the total

    letters of the texts formed by these seven letters and percentage of the total

    words, starting by these five words initials can be summarized in the form

    of following table:

    Table 3.5.3. % of total letters and % of total word tokens of the texts

    formed by 7 most frequent letters and 5 most frequent words initials.

    Tan

    av

    se m

    ukti

    Tex

    t 1

    Tex

    t 2

    Tex

    t 3

    Tex

    t 4

    Tex

    t 5

    Tex

    t 6

    % of total letters formed

    by 7 most frequent letters 55.2 53.1 53.4 53.9 54.98 52.6 53.9

    % of total word-tokens for

    the words started by any

    one of the 5 most

    frequent words initials

    50.2 42.4 42.6 46.3 47.9 46.2 46.6

    Thus out of all used letters (excluding matras) of Hindi language

    alphabet in a text more than 50% are one of , , , , , , and out of

    all used words more than 40% words have their initial letter one of , , ,

    Este

    lar

  • Chapter 3

    98

    , . In the next section we have compared the entropies of these seven

    most frequent letters in texts and five most frequent word initials.

    Here we have taken an initiative in order to improve the values of

    the determination coefficients in the form of replacement of r by )( r in

    the Yule equation ,rbr

    cr

    aF = to take the form as

    r

    br

    brc

    r

    kc

    r

    aF 11)()(

    =

    = (3.5.1),

    where 1c , b1 and k are parameters and is a small number less than 1, for the rank frequency distribution of occurrence of letters as words initial, for

    whom the value of determination coefficient for the texts text 1 and text 2

    is not greater than 0.98. After this we have calculated the values of the

    determination coefficients-

    0.983 for text 1 by )( r and =0.9 ,

    0.986 for text 2 by )( r and = 0.95,

    0.99 for text 3 by )( r and = 0.95 ,

    0.989 for text 4 by )( r and = 0.8 and

    0.989 for text 6 by )( r and = 0.9; and for the text 5 and the text Tanav se Mukti there is not any considerable improvement. Thus the

    model for the rank frequency distribution of letters in the beginning of

    words of the text can be formulated as

    r

    brc

    r

    aF

    )( = (3.5.2),

    c, b and a are parameters and is either zero or a number between 0 and 1.

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    99

    3.6. Entropies of most frequent consonants

    We have so far investigated that all five most frequent word initials

    and seven most frequent letters in text are consonants and each consonant

    in a text can have three types of instances- as a consonant, as half

    consonant(or consonant followed by a halant ) and as consonant

    followed by any matra ( /, , , , , , , , , , , ). Thus

    each consonant can occur in 14 different ways in text, where the

    occurrences of in , ( + + , + + ) etc. have been

    supposed as occurrence of followed by matra . In order to calculate

    the entropy for the pattern of occurrence of most frequent consonants, we

    have analyzed occurrences of each of seven most frequent letters in

    different texts and the pattern of the occurrence of five most frequent

    initials. The obtained pattern for the occurrence of seven most frequent

    letters in the text Tanav se Mukti has been shown in the table 3.6.1,

    given in the next page.

    As the entropy (or the degree of uncertainty) of the distribution of a

    random variable x is:

    =Xx

    xpxpxH )(log)()( 2 (3.6.1),

    where p(x) is the probability of x. We can infer that if the random variable

    X for a particular consonant is taken as the matra following that

    consonant then X={ none, /, , , , , , , , , , , /, },

    where occurrence of a consonant as a half letter has been assumed as the

    occurrence followed by .

    Este

    lar

  • Chapter 3

    100

    Table 3.6.1. Frequencies for patterns of occurrences of most frequent

    letters for the consecutive matra in the text Tanav se Mukti.

    Pa

    tter

    n of

    occu

    rren

    ce

    L

    ette

    rs

    Whole letter 1688 2550 577 1278 1053 827 785

    With / 710 285 223 650 225 709 362

    With 243 133 159 164 59 404 99

    With 326 102 414 106 124 163 22

    With 82 90 94 127 141 111 49

    With 8 16 13 4 12 8 27

    With 44 2 14 0 3 2 18

    With 535 241 80 516 418 338 444

    With 7 3 1113 3 3 7 13

    With 493 219 465 43 13 137 19

    With 19 0 3 10 15 1 3

    With or as half letter 303 454 7 323 425 385 65

    With / 3 19 16 3 154 7 14

    With 7 2 0 9 0 21 1

    Total 4468 4116 3178 3236 2645 3120 1921

    The values of entropies obtained by the equation (3.6.1) for , , ,

    , , , from the table (3.6.1) for the subsequently followed matras

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    101

    (discussed above) are respectively 2.695, 2.008, 2.669, 2.474, 2.607, 2.808,

    and 2.401. Similarly entropies of most frequent word initials , , , ,

    are respectively 2.811, 1.992, 2.427, 2.666 and 2.114. The calculated

    values of entropies for all the considered texts have been shown in the

    following figures:

    Figure 3.6.1. Entropies of seven most frequent letters for their pattern of

    occurrence

    Figure 3.6.2. Entropies of most frequent word-initials for their pattern of occurrence

    1.5

    2

    2.5

    3

    letters

    entr

    op

    y

    Text 1

    Text 2

    Text 3

    Text 4

    Text 5

    Text 6

    Tanav seMukti

    1.5

    2

    2.5

    3

    3.5

    letters

    entr

    opy

    Text 1

    Text 2

    Text 3

    Text 4

    Text 5

    Text 6

    Tanavse Mukti

    Este

    lar

  • Chapter 3

    102

    On the basis of this study applied to seven texts, we can say that the

    letter is much certain (less entropic) than 6 other most frequent letters in

    all the considered texts. It is followed by except in the text Tanav se

    Mukti in which has less entropy than . In case of words initial the

    occurrence pattern is that is much certain in the texts- Tanav se Mukti,

    Text formed by 5 stories, the text formed by 29 short stories, the text

    formed by poems and in the text formed as the mixture of all the texts. In

    the texts formed by 62 articles is less entropic followed by and in the

    text formed by 37 articles is much certain followed by and the

    entropies of these two are nearly equal thus can be considered as much

    certain words initial than other most frequent initials with regard to its

    pattern of occurrence.

    3.7. Conclusion

    In this information age, storage and retrieval of information in a

    convenient manner has gained importance. Because of the near-universal

    adoption of the WorldWideWeb as a repository of information for

    unconstrained and wide dissemination, information is now broadly

    available over the internet and is accessible from remote sites.

    However, the interaction between the computer and the user is

    largely through keyboard and screen-oriented systems. In the current

    Indian context, this restricts the usage to a minuscule fraction of the

    population, who are both computer-literate and conversant with written

    English. In order to enable a wider proportion of the population to benefit

    from the internet, there is need to provide some additional interface with

    Este

    lar

  • Mathematical modeling of frequencies of letters in Hindi texts

    103

    the machine and hence the utility of such type of researches can be

    understood.

    The discussions in the present chapter can be concluded as-

    , , , , , , are seven most frequent letters of Hindi

    language and these form more than 50% of texts in terms of

    letters of Hindi language alphabet.

    , , , , are five most frequent words initials and more

    than 40% of words of the text begin with any one of these letters.

    Frequencies of letters in a text and in combinations of texts

    follow Yule distribution ,rbr cr

    aF = where a , b and c are

    parameters.

    Frequencies of letters as words initials follow a Yule

    distribution ,rbr

    cr

    aF = which can be more properly expressed

    by a distribution of type rbr cr

    aF

    )( = where is either zero

    or lies between o and 1, and a , b and c are parameters.

    The letter is much certain than other 6 most frequent letters

    and is much certain words initial than other 4 most frequent

    words initials in the form of their patterns of occurrence.

    Este

    lar

  • Este

    lar