This is an accepted manuscript of an article published by...

20
This is an accepted manuscript of an article published by Taylor and Francis in Journal of Quantitative linguistics on 17/12/2014, available online: http://dx.doi.org/10.1080/09296174.2014.974457

Transcript of This is an accepted manuscript of an article published by...

  • This is an accepted manuscript of an article published by Taylor and Francis in Journal of Quantitative linguistics on 17/12/2014, available online: http://dx.doi.org/10.1080/09296174.2014.974457

  • Analysis and mathematical modeling of the pattern of occurrence of various Devangari letter symbols according to the Phonological Inventory of Indic script

    in Hindi language

    Authors: Hemlata Pande1 and Hoshiyar. S. Dhami Department of Mathematics,

    Kumaun University, S. S. J. Campus Almora,

    Uttarakhand, INDIA Email: [email protected], [email protected]

    1 Author for correspondence

  • Analysis and mathematical modeling of the pattern of occurrence of various Devangari letter symbols according to the Phonological Inventory of Indic script

    in Hindi language

    ABSTRACT Present paper is an attempt in the direction to analyze the pattern of occurrence of different

    alphabets of Hindi alphabet or varaml in text and corpus of Hindi. One text selected and nine corpora have been formed and taken for the present study by the compilation of different texts or corpuses selected from diverse sources. An assessment of the relative proportion of the vowels and members of different groups of Devangari symbols, according to the Phonological Inventory of Indic script, by the rank frequency approach has been done and the Zipfs orders of the groups have been discussed. The entropic measure of various groups has also been compared. Moreover the characteristic curves for the Hindi language text or corpus regarding to the proportion of various groups of symbols and consonants have been presented. By using the linear regression techniques for one and two independent variables, a suitable model for the frequencies of different groups has been determined.

    Key words: Zipfs order, entropy, model, characteristic curve, regression.

    Introduction

    Bolshakov and Gelbukh (2004) have mentioned that In the recent years, the interest to empirical approach in linguistic research has livened. The empirical approach is based on numerous statistical observations gathered purely automatically. Goyal (2011) has cited that Statistical analysis of a language is a vital part of natural language processing. Numerous analyses and models of various languages have been presented by different researchers corresponding to diverse components of language. For the Hindi language and some Indian languages, we can cite the works of Bharati et al (2002) for statistical analyses of ten Indian languages corpora based on unigram frequencies, bigram frequencies, syllable frequencies etc.; of Jayaram and Vidya (2006 and 2008) for the study of the word length distribution for Hindi and some other Indian languages and for the word frequency distribution of Hindi and three other Indian languages respectively. In our earlier work (Pande and Dhami, 2010) we have presented the analyses and model for the pattern of occurrence of different letters in different texts and in the initial position of words of texts for Hindi language texts. Goyal (2011) has done the statistical analysis of Hindi language and compared it with the analysis of Punjabi language. In an another work we (Pande and Dhami, 2013) have presented the analysis, model and various measures for the words occurring in different corpora of Hindi language.

    Sanderson (2007) has pointed out that the distribution of letters in a text can help in the process of language determination. Murthy and Kumar (2006) have mentioned that the Devangari script is used in various languages and they have emphasized on the fact to identify language irrespective of the font or script being used. The letter/grapheme frequencies and their corresponding models have been studied by various investigators, in this context the works of Good (1969), Solso and King (1976), Bell and Witten (1988), Grzybek and Kelih (2005), Eftekhari (2006) and Sanderson (2007) can be cited, to mention only a few. Most of these works are for the English and other languages than Hindi and for the analysis and modeling of the pattern of occurrence of letters in Hindi our earlier study (Pande and Dhami, 2010) can be mentioned.

  • Singh(2006) has refered that most of the major Indian languages make use of scripts which have developed from the ancient Brahmi script. Devangari is Bhahmi-derived script (Gupta, 2008) and Hindi language mostly is written in Devangari. Languages such as Sanskrit, Nepali, Marathi etc. are also written in Devangari. Due to phonetic nature of Devangari alphabet Vowels and consonants separated and Consonants themselves separated on the basis of phonetic features (Singh,2006). In the alphabetical array of letters in Hindi; first vowels are listed and are followed by the consonantsi.

    A letter in Hindi is called vara and the alphabetic order chart of alphabets of Hindi is called varaml. In varaml the letters are arranged according to the manner and place of articulation. In the most general traditional form of varamlii (according to which school-children learn the alphabets of Hindi) the letters are arranged in the following form (arrangement of alphabets in varaml has been shown in the Appendix):

    First the 13 vowels are placed followed by 33 consonants which are placed in different seven lines. These consonants have been referred as the basic consonants (Goyal and Lehal, 2008), pure consonants (Dhore et al, 2012) and also as main consonantsiii. After that are three consonants termed as Consonant Conjucts (Goyal and Lehal, 2008) or conjunct charactersiv. These characters are mentioned in the last row of consonants in the Appendix.

    The same arrangement except exclusion of the letter () has been mentioned as the core part of alphabet for the alphabet for all Brahmi origin scripts by Singh (2006). Occurrence rate of in Hindi language texts is generally lower. Highest vaue of the proportion of with respect to total (independent) occurrence of symbols as mentioned in Appendix has been obtained 0.038% for our considered texts and corpora (discussed in next section).

    However there is not unanimity in defining the alphabets of Devangari script; for example Kumar (2006) has defined only 33 consonants (basic consonants) and also defined the seldomly used characters , , (, , ) as vowels while in some works only first 11 vowelsv (Appendix) are defined and the last two have been taken in their dependent forms ( to adjoine with other vowels or consonants) are mentioned as mentioned sound modifiersv. Similarly for Devangari alphabet Joshi

    et al (2004) have comprised symbol in the alphabet which is not used in Hindi, excluded () in alphabet list for the most frequently used vowels; and they have included (ra) also together with the three above mentioned conjuncts (similar inclusion of ra ia also found in Singh (2006), Goyal and Lehal(2008)). The conjuncts consonants in the writing system of Devangari are written with the help of adjoining of consonants and the symbol (called halant) e.g. = + + , There are also other conjuncts present in the Hindi language texts other than discussed above four but in traditional varaml only three are defined.

    The aim of the present paper is not to discuss different forms of alphabet but to characterize the language and to identify the pattern of occurrence of symbols in different Hindi language texts if some particular arrangement of alphabets of Devangari is selected. We have chosen collections of symbols mentioned in the paper of Gupta(2008) and attempted to show that if this particular collection of script is selected then text and corpus of Hindi language follow specific properties which can be used to characterize the Hindi on the basis of occurrences of different symbols.

    Besides this consonants in Hindi language texts are sometimes also occurred with a dot below them: for example , (qa, za) etc.; these modified forms of consonants can be regarded as variants of their base characters. Vowels occur in Hindi in two forms: independent and dependent; in the independent form of vowels the occurrence is as a whole vowel as listed in the appendix while in the

  • independent occurrence a mark (called mtr) is associated corresponding to the considered vowel after other vowels, consonants or mtrs.

    Gupta(2008) used the term Indic script for Devangari; according to the author In Indic script, the symbols are systematically organized based on articulatory phonetics . Vowels and consonants are arrangrd by place and manner of articulation. To characterize Hindi language on the basis of presence of different alphabets in a text and nine corpora, we have presented analyses regarding to the pattern of occurrence of the alphabets (aksharas) of varaml, by considering different classes. In the considered text and corpora we have also compard the entropic measurements of different groups. The classification of the Devangari symbols of letters according to the Phonological Inventory of Indic script from the paper of Gupta(2008) have been selected; author in the work has presented chart for the arrangement of sounds for an Indic script and the symbols used for Devanagari for the sounds. According to this arrangement 13 vowels and 33 basic consonants are arranged in different rows: two rows are for vowels. In the first row there are first 7 vowels from the Appendix and remaining 6 vowels are paced sequentially in the second row. Primary vowels and Secondary vowels captions have been cited to the vowels of these two rows respectively in the paper (Gupta, 2008). The first 25 consonants of the appendix have been mentioned as consonants by the author and are placed 5 each consequently in 5 rows (in the similar pattern as in appendix). Their categorization rowwise and columnwise in two ways in five subclasses has also been mentioned in the paper. According to the first way(rowwise) in Velar, Palatal, Retroflex, Dental and Labial; and in other way(columnwise) the groups are: Voiceless Plosives (Unaspirated and Aspirated), Voiced Plosives (Unaspirated and Aspirated) and Nasal. Remaining 8 consonants are placed in three lines, the first line for Semi-Vowels, second for Sibilants and the last one for Glottal containg 4, 3 and 1 consonants of the appendix respectively. In present study, the characteristic curves for the frequencies of the above mentioned groups of symbols and groups of consonants have been presented. A suitable model for the rank frequency relation for various groups of symbols has also been discussed. Section 2 confers the pattern of occurrence of members of different groups regarding to their Zipfs orders and their presence in the text and corpora while the study for the entropies of various groups have been argued in Section 3. Sections 4 and 5 respectively refer to the characteristic curves and mathematical model for rank frequency relation.

    1. Methodology and data

    The work has been initiated by taking one text and forming nine written corpora for the current study:

    The first Text (T1) has been selected as the text Tanav se Mukti by the author Shivanand. The second corpus (T2) is the narrative corpus and is formed by the compilation of five

    stories ( namely Cheelein by Bhisham Sahni, Kabuliwala by Rabindranath Tagore, Haar Ki Jeet by Sudershan Vairagya and Vardan by Munshi Premchand). For the third narrative corpus (T3), we have compiled 30 short stories from the Katha-Sagar section of Navbharat Times. Similarly 62 articles from the Sampadakeey section and 37 articles from Nazaria section have been accumulated for the formation of fourth (T4) and fifth written corpora (T5) respectively from Navbharat Times2. These two can be considered as the written corpora for Hindi newspaper-articles.

    2 Articles for corpora T3, T4 and T5, selected from the website of Navbharat Times, are of time period from 30 Oct. 08 to 5 Dec. 08. Katha is the Hindi translation of a story and Katha-Sagar refers collection of short stories. Similarly from Sampadakeey(Hindi translation of editorial) section editorials and from Nazaria (viewpoint) perspective articles by different writers have been selected.

  • The sixth corpus (T6) is a verse corpus formed by the composition of 180 poems from Kaavyaalaya The House of Hindi Poetry (55 poems from navkusum section and 125 poems from yugvani section). These text/corpora T1 to T6 have also been studied for pattern of occurrence of different letters and words initials and pattern of different words in our earlier studies (Pande and Dhami (2010, 2013), and texts for these were selected from the link mentioned in Pande and Dhami(2010)).

    Three literary corpora: corpus T7, corpus T8 and corpus T9 have been formed by composition of 45 texts from novel category, 20 texts from essay category and 28 texts from myth category of written Hindi monolingual corpus section of the corpus ELRA-W00373 acquired from ELDA ( Evaluations and Language resources Distribution Agency )4.

    All the text/corpora T1 to T9 have been compiled to form the written corpus for Hindi language T_tot (containg 8,25,50,5 independent vowels and basic consonants in all).

    The frequencies of occurrence different letters of alphabet have been calculated by considering the same rules as used by us in our previous study (Pande and Dhami, 2010) except exclusion of conjuncts that:

    We have counted all the occurrences of a letter as whole letter without any mtr; or in other words the

    independent forms of vowels occurred only considered; occurrence of basic consonants followed either by a mtr ( a Hindi word which represents

    symbols corresponding to vowels and are adjoined after other letters for example , , etc.) or occurrence of consonants in the form of half letters anywhere in the word(occurrence in the form of half letter of a consonant is when the consonant is followed by the symbol );

    occurrence of has been regarded as the occurrence of , similarly the occurrence of ,

    , etc. have been regarded as , etc. and occurrence of , , etc. have been

    assumed the occurrence of followed by mtr and similarly for other consonants;

    incidence of consonents in the form of , etc. have been merged with the occurrences of ,

    and so on.

    as occurrence of basic consonants and vowels have been considered only we have first removed all occurrence of conjunct consonants (ka), (tra), (gya) so that the occurrence of the basic consonants in forming these conjuncts are not considered.

    2. Zipfs order and relative proportion of different groups After considering the occurrences of the alphabets, we have determined the frequencies of the

    Primary Vowels, Secondary Vowels and the total frequency of the consonants of Velars, Palatals, Retroflexes, Dentals, Labials, Semivowels, Sibilants and Glottal. For the purpose, we have considered the characterization of letters in ten groups as stated in S. No. 1 to S. No. 10 in the table below. These goups of symbols have been as mentioned as the Phonological Inventory of an Indic script in the work of Gupta (2008). For example for the text Tanav se Mukti the arrangement of different vowels and consonants in the text is according to the frequencies as mentioned in the

    3 ELRA catalogue (http://catalog.elra.info), The EMILLE/CIIL Corpus, catalogue reference: ELRA-W0037. 4 in section miscellaneous tagged as literature-novel; literature-essay and literature-myths respectively.

  • following Table 1, where the occurrence has been considered after merging the occurrence as above

    discussed characters( with , with etc) .

    Table1. Frequencies of various groups of alphabets in the text T1.

    S. No. Groups Members Total frequency of occurrence of members of

    groups 1 Primary Vowels ////// 1971 2 Secondary Vowels ///// 864 3 Velars //// 5610 4 Pre-palatals //// 1823 5 Retroflexes //// 1100 6 Dentals //// 8649 7 Labials //// 5384 8 Semivowels /// 8647 9 Sibilants // 3595 10. Glottal 3178 Total 40821

    The Zipfs orders, as discussed by Eftekhari (2006), of the above mentioned ten groups have

    been determined in the objects T1 to T9 and T_tot. The obtained Zipfs orders are as shown in the table below:

    Table 2. Zipfs orders of different groups of alphabets.

    Text/ Corpus

    Zipfs order of different groups of vowels and consonants

    T1 Secondry vowels Retroflexes Pre palatals Primary vowels Glottal Sibilants Labials Velars Semivowels Dentals

    T2 Secondry vowels Retroflexes Pre palatals Primary vowels Sibilants Glottal Velars Labials Dentals Semivowels

    T3 Retroflexes Secondry vowels Pre palatals Primary vowes Glottal Sibilants Labials Velars Dentals Semivowels

    T4 Secondry vowels Retroflexes Pre palatals Primary vowels Glottal Sibilants Labials Velars Dentals Semivowels

    T5 Secondry vowels Retroflexes Pre palatals Primary vowels Glottal Sibilants Labials Velars Dentals Semivowels

    T6 Secondry vowels Retroflexes Primary vowels Pre palatals Glottal Sibilants Velars Labials Dentals Semivowels

    T7 Retroflexes Secondry vowels Pre palatals Primary vowels Glottal Sibilants Labials Velars Dentals Semivowels

  • T8 Secondry vowels Retroflexes Pre palatals Primary vowels Sibilants Glottals Velars Labials Dental Semivowels

    T9 Secondry vowels Retroflexes Pre palatals Primary vowels Glottal Sibilants Labials Velars Dentals Semivowels

    T_tot Secondry vowels Retroflexes Pre palatals Primary vowels Glottal Sibilants Labials Velars Dentals Semivowels

    The Table 2 depicts that the style of occurrence of members of different groups of

    Phonological Inventory of an Indic script follows a clear pattern in various corpora in the sense that: Semivowel and dental always have Zipf order nine or ten; Velar and labial always have Zipf order seven or eight; Sibilant and glottal always have Zipf order five or six; Pre-palatal and primary vowel always have Zipf order three or four; Secondary vowel and retroflex always have Zipf order one or two.

    The relative frequencies for the occurrence of members of different groups for considered text and various corpora have been shown in the following column graph:

    Figure 1. Relative proportion of various groups of alphabets In this figure , for each text or corpus the columns are corresponding to relative normalized

    frequencies of each of the groups: Primary Vowels, Secondary Vowels, Velars, Pre-palatals, Retroflexes, Dentals, Labials, Semivowels, Sibilants and Glottal respectively.

    The figure shows that the groups in pair are present in the considered text/corpora in some particular non overlapping ranges of relative frequencies. In the studied objects the ranges of the groups have been shown in the following table (Table 3).

  • Table 3. Ranges of relative proportion of various groups Group of symbols Range of presence proportion in different

    corpora Semivowel/dental 0.172185-0.228829

    Velar/labial 0.127952-0.156119 Sibilant/glottal 0.064997-0.088226

    Pre-palatal/primary vowel 0.044658-0.060537 Secondary vowel/retroflex 0.014877-0.036172

    The pattern of occurrence of the consonants has also been analyzed by considering five groups (corresponding to first 25 consonants mentioned in appendix, taken columnwise) as given away in the paper of Gupta (2008) mentioned in the table below, which presents the frequencies of the five groups in the text Tanav se Mukti (Text T1):

    Table 4. Frequencies of different groups of consonants in Text T1 S.

    No. Groups of consonants Members Total frequency of occurrence

    of members of groups 1 Unaspirated Voiceless

    Plosives //// 10401 2 Aspirated Voiceless

    Plosives //// 1290 3 Unaspirated Voiced

    Plosives //// 3773 4 Aspirated Voiced

    Plosives //// 1626 5 Nasals //// 5476 Total

    22566

    The obtained Zipfs order of these groups in various objects have been mentioned in the table

    below:

    Table 5. Zipfs orders of groups of consonants Text/Corpus Zipfs order T1 Asp._voiceless_plo. Asp._voiced_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo. T2 Asp._voiced_plo. Asp._voiceless_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo. T3 Asp._voiced_plo. Asp._voiceless_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo. T4 Asp._voiced_plo. Asp._voiceless_plo. Nasal Unasp._voiced_plo.

    Unasp._voiceless_plo. T5 Asp._voiced_plo. Asp._voiceless_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo. T6 Asp._voiceless_plo. Asp._voiced_plo. Unasp._voiced_plo. Nasal

  • Unasp._voiceless_plo. T7 Asp._voiceless_plo. Asp._voiced_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo. T8 Asp._voiced_plo. Asp._voiceless_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo. T9 Asp._voiceless_plo. Asp._voiced_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo. T_tot Asp._voiced_plo. Asp._voiceless_plo. Unasp._voiced_plo. Nasal

    Unasp._voiceless_plo.

    From the table of Zipfs orders it can be concluded that the pattern of occurrence of various consonants is such that there is no uncertainty about the Zipfs order of the of the group of Unaspirated voiceless Plosives. This group always has Zipfs orders five i.e. is the most frequent group of consonants. The groups Aspirated Voiceless Plosives and Aspirated Voiced Plosives have Zipfs order one or two and groups Nasals and Unaspirated voiced Plosives have the Zipfs order three or four.The relative proportions of the five groups of consonants in the ten text/corpora have been shown in the figure below:

    Figure 2. Relative proportion of groups of consonants

    The columns in the above figure are for the relative normalized frequencies for the groups : Unaspirated Voiceless Plosives, Aspirated Voiceless Plosives, Unaspirated Voiced Plosives, Aspirated Voiced Plosives and Nasals respectively. The figure depicts that the values of relative frequencies of the groups: (a). Unaspirated voiceless Plosives, (b). Nasal and Unaspirated voiced Plosives and (c). Aspirated Voiceless Plosives and Aspirated Voiced Plosives are in different non overlapping regions 0.383716 - 0.474439, 0.165966 - 0.25718 and 0.048308 - 0.088869 respectively. These non overlapping different ranges of relative frequencies can help us in characterization of Hindi language on the basis of presence of different consonants only.

  • 3. Entropic measurement of different groups We have made the entropic measurements of the various groups of Devangari symbols and

    different groups of consonants. The entropy of a group has been calculated by the formula

    2( ) ( ) lo g ( ( ) )x X

    e n tr o p y H x p x p x

    = (1) for the random variable x. Where p(x) is the probability of occurrence of the member x of the group. For example for the group primary vowel, the probability of random variable is the probabilaity of occurrence of the members of X= { , , , , , , }.

    If x can take total 0n values for a group then the maximum value of the entropy for the group will be 0log( )n . The values of 0n for the grops are as: for group primary vowels: 0 7n = , for secondary vowels: 6, for each of velars, pre-palatas, retroflexes, dentals, and labials the value is 5, semivowels: 4, sibilants: 3 and glottal: 1 and as far as various groups of consonants are concerned, the value of 0n for each of the five groups(Unaspirated voiceless Plosives, Unaspirated voiced Plosives, Aspirated Voiceless Plosives , Aspirated Voiced Plosives and Nasal ) is 5. As the random variable can take only one value for the group glottal, the entropy for this group is zero for every corpora. To check for the regions for the values of the entropies corresponding to different groups we have plotted

    the values of 0log( )

    entropy

    n(entropy of a group relative to the maximum entropy of that group) for these

    groups. The plots for different groups of Devanagari symbols and groups of consonants have been shown in the following two figures:

    Figure 3. Values of entropy of different groups of symbols relative to the maximum entropy of the group

  • Figure 4. Values of entropy of different groups of consonants relative to the maximum entropy of the

    group

    In the figure 3 the columns represent the values corresponding to the groups Primary Vowels, Secondary Vowels, Velars, Pre-palatals, Retroflexes, Dentals, Labials, Semivowels, Sibilants; and in figure 4 of Unaspirated Voiceless Plosives, Aspirated Voiceless Plosives, Unaspirated Voiced Plosives, Aspirated Voiced Plosives and Nasals respectively. The figures depict that the values of

    0log( )

    entropy

    n for the groups semivowels and dentals and similarly for Unaspirated Voiced Plosives group

    are nearly stable and are in the regions 0.889862 0.930919, 0.820588 0.859543 and 0.930647

    0.980046 and for other groups of symbols and consonants, there is relatively large variations in the values. For the group Nasal of consonants the relative entropy is very lower than the other groups of consonants and its value for the studied objects is within range 0.480724 0.545141

    4. Characteristic Curves The frequencies of the presence of different groups (their relative proportion ) when arranged

    in alphabetical (rowwise) order of varaml for vowels and basic consonants and (columnwise) groups of consonants have been drawn in the form of charts. The characteristic curves (showing the pattern of variation in the frequencies of the groups) for the of different groups of symbols and different groups of consonants have been shown in the figures below:

  • Figure 5. Characteristic curve for Hindi regarding to the presence of varioos groups of symbols

    Figure 6. Characteristic curve for Hindi regarding to the presence of varioos groups of consonants

    The two figures depict the similar pattern of variation for all the considered objects thus in

    the case of various groups of Devangari symbols as well as in the case of different groups of consonants, the frequencies of presence of these groups in various corpora of Hindi language follow a particauar kind of curve. These characteristic curves can be used to identify specific property of Hindi language corpora.

  • 5. Model for the frequencies of different groups in corpora To determine the model for different groups of the symbols, we have arranged the frequencies

    of different groups in descending order and have obtained the rank frequency profile by assigning the rank 1 to the highest frequent group, rank 2 to second most frequent one. and so on, as proposed in Zipfs law (Zipf, 1949). For example for the text T1 (data mentioned in Table 1 of Section 2), the rank frequency profile has been presented tn the following table:

    Table 6. Rank frequency profile of the text T1 for different groups of symbols

    Rank Frequency Rank Frequency 1 8649 6 3178 2 8647 7 1971 3 5610 8 1823 4 5384 9 1100 5 3595 10 864

    After determining the rank and frequency profile we have tried to determine a suitable two or

    three parametric model for the data and for this purpose we have applied the linear regression to different eight two or three parametric functional relations, as mentioned in Li and Miramontes (2011), to linar relations between / log( )f f and 1 2( ), ( )r r (functions of rank r). The used eight functions are(applied by Li and Miramontes (2011) to the frequencies of letters of English and Spanish): Power law, exponential equation, logarithmetic equation, Weibull, quadratic logarithmetic, Yule, Inverse gamma and Beta. For the power law relation, exponentioal relation , Yule, inverse gamma and Beta function the regression have been applied in logarithmic scale. The values of the parameters have been determined with the help relations mentioned in the article Regression with Two Independent Variables available at:

    . The equations used for the regression in linear form corresponding to various used relations are:

    Power law: log( ) log( )f a b r= + (2) Exponential equation: log( )f a br= + (3) Logarithmic equation: log( )f a b r= + (4)

    Weibull: 1

    log( ) log logn

    f a br

    + = +

    (5)

    Quadratic logarithmic: 2log( ) (log( ))f a b r c r= + + (6) Yule: log( ) log( )f a br c r= + + (7)

    Inverse gamma: log( ) log( )b

    f a c rr

    = + + (8)

  • Beta: log( ) log( 1 ) log( )f a b n r c r= + + + (9),

    where f represents the frequency and r the rank for different groups of symbols (as mentioned in Table 6 for Text T1 )and n is the value of the maximum rank that a group can have and in the present case its value is 10.

    Besides these relations we have also tried the following equation for regression f a br= + (10),

    the linear relation between rank and frequency, and the relation log( ) log( 1 )f a b n r= + + (11)

    The equation (11) can be assumed corresponding to the relation baf i= , as mentioned by

    Eftekhari (2006) between the Zipfs order i and the frequency f . As the Zipfs order have been obtained by the author by arranging the terms in (ascending) order of their frequencies.

    To select the appropriate model out of the considered two and three parameter models, after determination of the values of theoretical frequencies ( f ) with the help of equations (2) to (11) mentioned above we have determined the values Akaike information criterion (AIC) for all these relations and then we have determined the values of AIC score (the difference of value of AIC for each model and the smallest value of AIC ), where the value of AIC (Li et al, 2010) for a particular model has been calculated after determination the expected values of the frequencies f with the help of the relation:

    log 2(Number of parameters in the model)SSE

    AIC nn

    = +

    (12),

    Where SSE is the sum of squared errors. The three lowest determined values of AIC and corresponding equations for the expected frequencies obtained with the help of above discussed equations (equation (2) to equation (11)) regarding to the ten considered text/corpora have been presented in the following table:

    Table 7. Three lowest values of AIC for expected and observed frequencies and

    corresponding equation for linear regression Text/Corpus Equation and

    corresponding AIC for expected and observed

    frequencies

    Equation and corresponding AIC for expected and observed

    frequencies

    Equation and corresponding AIC for expected and observed

    frequencies T1 (7) 0 (6) 1.963894

    (8) 3.180949

    T2 (7) 0 (6) 1.10865

    (10) 3.003501

    T3 (7) 0 (6) 2.242977

    (5) 5.389931

    T4 (7) 0 (6) 1.248638

    (3) 3.178002

    T5 (7) 0 (6) 1.817901

    (3) 4.095938

    T6 (7) 0 (6) 0.579677

    (9) 2.812777

  • T7 (7) 0 (6) 2.551609

    (3) 4.398999

    T8 (7) 0 (6) 1.218657

    (5) 3.210592

    T9 (7) 0 (6) 2.422365

    (3) 2.92761

    T_tot (7) 0 (6) 1.508989

    (5) 3.153547

    Thus for all the considered objects the value of AIC is minimum if the frequencies are

    determined with the help of equation (7) i.e. if the relation between rank and frequency is taken in the form:

    r

    C

    ABf

    r= (13)

    Therefore this model can be assumed as the best model out of all considered models for the relation between the ranks and corresponding frequencies of occurrence of different groups of symbols.

    The graph for the observed frequencies and the theoretical frequencies corresponding to the Yules relation ( equation (13)) for the text T1 has been shown below:

    Figure 7. Observed and theoretical frequencies (obtained with the help of equation (7)) of

    different groups arranged accordind to their ranks for the text T1.

    To check whether this equation satisfies the condition of goodness of fit, the values of the discrepancy coefficient have been determined by the relation

    2

    discrepancy coefficientN

    = , where 1

    r

    n

    N f=

    = is the total frequencies of all the groups. The calculated values of the discrepancy coefficient and the values of the parameters A, B and C (in

  • equation (13)) for the considered text and different corpora have been mentioned in the following table:

    Table 8

    Text/Corpus 2

    N

    N A B C

    T1 0.007335 40821 12410.82 0.707508 -0.346341 T2 0.00877 18108 4834.596 0.72907 -0.36219 T3 0.005525 22202 6229.405 0.712974 -0.395811 T4 0.005006 70308 19551.36 0.757188 -0.195518 T5 0.004643 82236 23618.52 0.750016 -0.198605 T6 0.009067 90473 27825.19 0.674372 -0.503968 T7 0.005782 105727 32099.43 0.728962 -0.24489 T8 0.009366 224083 61989.09 0.719185 -0.37781 T9 0.005989 171547 53391.38 0.738399 -0.177087 T_tot 0.007488 825505 240698.3 0.728941 -0.283204 Thus the value of discrepancy coefficient corresponding to the Yules equation satisfies the

    condition of very good fit

    2

    0.01N

    < in case of each text or corpus (the average value of discrepancy coefficient for the considered text and corpora is 0.0069) therefore it can be concluded that in different corpora of Hindi language the frequencies of occurrence of different groups of symbols follow Yules relation. We have also applied different distributions available at Altmann Fitter for the rank frequency data for different groups for the considered text and corpora. The order of distributions according to the values of discrepancy coefficient varies for different texts/corpora. For example the Consul-Mittal-binomial with 3 parameters and Doubly truncated negative binomial distributions are one of the three distributions having the minimum value of the discrepancy coefficient for 8 and 7 text/corpora (out of 10 considered) respectively . These models satisfy the codition of very-good fit for 8 and 9 cases respectively (giving good fit for others) and the average value of the discrepancy coefficient regarding to these two distributions for the considered text and corpora are 0.0073 and 0.0076 correspondingly. Therefore we can assume our considered model is the proper choice for the model for the rank and frequency distribution of different groups.

    6. Conclusion An analysis has been maid for various groups of Devangari symbols, according to the Phonological Inventory of Indic script and groups of consonants as mentioned in Gupta (2008) with the help of Zipfs orders, proportions and entropic measurements. It has been concluded that

    Groups of symbols have their presence in different text/corpora in some particular non overlapping ranges (defined for pairs of groups) of relative frequencies.

    10 groups have specific values of the Zipfs order.

  • Similarly the group Unaspirated voiceless Plosives has relatively higher proportion in different corpora relative to other groups of consonants and the frequencies of the remaining four groups of consonants (in pairs) have the values in some specific ranges.

    Specific Zipfs orders have also been defined for the five groups of consonants.

    As far as the values of 0log

    entropy

    n for the groups are concerned (entropy of a group relative to

    its maximum entropy), the values for the groups semivowels and dentals of all symbols and and similarly for groups Unaspirated Voiced Plosives are nearly stable are in the specific ranges and the group Nasals of consonants has relatively lower entropy than other groups.

    The trend of relative frequencies of presence of various groups of symbols and consonants (when arranged rowwise)in a text or corpus is in the form of characteristic curve similar to the curves given in the Figure 5 and Figure 6 respectively.

    The frequencies of occurrence of different groups (regarding their ranks) in Hindi language

    corpora follow Yules equation r

    c

    abf

    r= .

    Acknowledgement

    Authors are grateful to the University Grants Commission (UGC), New Delhi, INDIA for providing financial assistance in the form of Post doctoral fellowship [F.4-2/2006(BSR)/13-770/2012(BSR)] to the first author. The research has been sponsored by the UGC under the UGC Dr. D. S. Kothari Post Doctoral fellowship scheme.

    References Bell, Timothy C. and Witten, Ian H. (1988). Source models for natural language.

    http://hdl.handle.net/1880/46172. Bharati, A., Rao K, P., Sangal R. and Bendre, S. M. (2002). Basic statistical analysis of corpus and

    cross comparison among corpora. In Proceedings of 2002 International Conference on Natural Language Processing, Mumbai, India. Available at: http://ltrc.iiit.ac.in/MachineTrans/publications/technicalReports/tr022/camera-187.pdf

    Bolshakov, I. A. and Gelbukh, A. (2004). Computational linguistics Models, Resources, Applications, Ciencia de la computacin, Mexico. Available at: http://www.gelbukh.com/clbook/Computational-Linguistics.htm

    Dhore, M. L., Dixit, S. K. and Dhore, R. M. (2012). Optimizing Transliteration for Hindi/Marathi to English Using only Two Weights. Proceedings of the First International Workshop on Optimization Techniques for Human Language Technology, , COLING 2012, Mumbai, pages 3148.

    Eftekhari, A. (2006). Fractal Geometry of Texts: An initial Application to the works of Shakespeare, Journal of Quantitative Linguistics volume13, Numbers 2-3, 177-193.

    Good, I. J. (1969) Statistics of language. Encyclopaedia of information, linguistics and control. Oxford: Pergamon, 567-581.

    Goyal, V. and Lehal, G. S. (2008). Comparative study of Hindi and Punjabi language scripts. Nepalese Linguistics, Vol. 23, 2008, pp. 67-82.

    Goyal, L. (2011). Comparative Analysis of Printed Hindi and Punjabi Text Based on Statistical Parameters . Information Systems for Indian Languages, Communications in Computer and Information Science, 2011, Volume 139, Part 2, 209-213.

  • Grzybek, P. and Kelih, E. (2005). Towards a General Model of Grapheme Frequencies in Slavic Languages. In: Garabk, R. (ed.), Computer Treatment of Slavic and East European Languages Bratislava: Veda , 73-87.

    Gupta, R. (2008). Initial literacy in Devanagari: What Matters to Learners. South Asia Language Pedagogy and Technology, Vol 1

    Jayaram, B. D. and Vidya, M. N. (2006). Word length distribution in Indian Languages, Glottometrics 12, 16-38.

    Jayaram, B. D. and Vidya, M. N. (2008). Zipfs law for Indian languages. Journal of quantitative linguistics, 15(4), 293-317.

    Joshi, A., Ganu, A., Chand, A., Parmar, V. and Mathur, G. (2004). Keylekh: A Keyboard for Text Entry in Indic Scripts. CHI '04 Extended Abstracts on Human Factors in Computing Systems. Pages: 928-942.

    V. Kumar (2006). A statistical approach towards the recognition of Hindi language words. inria-00114544, version 1, pp. 15. Available at < http://hal.archives-ouvertes.fr/docs/00/11/45/44/PDF/Hindi_Language_Recognition.pdf> Retrived on 15/03/2014.

    Li, W., Miramontes, P. and Cocho, G.(2010). Fitting Ranked Linguistic Data with Two-Parameter Functions. Entropy , 12, 1743-1764.

    Li, W. and Miramontes, P. (2011): Fitting Ranked English and Spanish Letter Frequency Distribution in US and Mexican Presidential Speeches, Journal of Quantitative Linguistics, 18(4), 359-380.

    Murthy, K. N. and Kumar, G. B. (2006). Language Identification from Small Text Samples. Journal of Quantitative Linguistics, 13(1), pp. 57 80.

    Pande, Hemlata and Dhami, H. S. (2010). Mathematical Modelling of Occurrence of Letters and Words Initials in Texts of Hindi Language. SKASE Journal of Theoretical Linguistics, Vol. 7, No. 2, 19-38.

    Pande, Hemlata and Dhami, H. S. (2013). Mathematical modeling of the pattern of occurrence of words in different corpora of Hindi language. Journal of Quantitative Linguistics 20(1), 1-12.

    Sanderson, R. (2007). COMP527: Data Mining. Retrieved December 25, 2007, from www.csc.liv.ac.uk/*azaroth/courses/current/comp527/lectures/comp52728.pdf

    A. K. Singh. (2006). A computational phonetic model for Indian language scripts. In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, 2006. Available at Retrived on 15/03/2014.

    Solso, R. L. and King, J. F.(1976). Frequency and versatility of letters in the English language. Bahavior research methods and instrumentation, 8, 283-286.

    Zipf, G. K. (1949). Human Behaviour and the Principle of Least Effort . Addison-Wesley Press.

  • Appendix

    Alphabetic order of letters in Hindi alphabet (varaml) their corresponding transliterationIV

    Vowels

    (a) () (i) () (u) () () (e) (ai) (o) (au) (a) (a)

    Consonants (ka) (kha) (ga) (gha) (a) (ca) (cha) (ja) (jha) (a)

    (a) (ha) (a) (ha) (a) (ta) (tha) (da) (dha) (na) (pa) (pha) (ba) (bha) (ma) (ya) (ra) (la) (va) (a) (a) (sa) (ha)

    (ka) (tra) (gya) Web references (Retrieved on 15/03/2014): i From ii http://www.isamaj.com/kidzcorner/hindi/Varnmala.html http://www.indif.com/kids/learn_hindi/hindi_alphabets.aspx http://iteachhindi.blogspot.in/2010/05/hindi-varnamala-hindi-alphabets.html iiihttp://www.learning-hindi.com/post/1573331422/lesson-72-ipa-and-hindi iv www.bodhgayanews.net/hindi/HIN11_Script_Intro.pdf v