Type Token

download Type Token

of 9

Transcript of Type Token

  • 7/28/2019 Type Token

    1/9

    A Type-Token Identity in the Simon-Yule Model of Text

    Ye-Sho ChenDepartment of Quantitative Business Analysis, Louisiana State University, Baton Rouge, LA 70803Ferdinand F. LeimkuhlerSchool of Industrial Engineering, Purdue University, West Lafayette, IN 47907

    There are three significant results in this paper. First,we establish a type-token identity relating the type-token ratio and the biiogarithmic type-token ratio. Theplays of Shakespeare and other interesting texts serveas demonstrative examples. Second, the Simon-Yulemodel of Zipfs law is used to derive the type-tokenidentity and provide a promising statistical model oftext generation. Third, a realistic refinement of theSimon-Yule model is made to allow for a decreasingentry rate of new words. Simulation methods are usedto show that the type-token identity is preserved withthis change in assumptions.

    1. introductionLinguistics is a relatively old discipline that has taken

    on new life because of recent developments in natural lan-guage understanding and artificial intelligence. Edmund-son [4,5] classified the field of linguistics into threesubfields: linguistic metatheory, which is the study of theuniversal properties of natural languages; computationallinguistics, in which the use of a computer is indispens-able; and mathematical linguistics, in which the use ofmathematical methods and theories are paramount. In thesubfield of mathematical linguistics itself, Edmundson fol-lowed Y. Bar-Hillels dichotomy: algebraic linguistics andstatistical linguistics, which include deterministic andstochastic models of language, respectively. Edmundsoncontinued: deterministic models and stochastic models de-serve equal attention. It was noted that models of linguisticcompetence have typically been deterministic models,while models of linguistic performance have been stochas-tic models. In any case researchers such as Miller andChomsky [14] have also stressed this difference and haveindicated their preferences and reasons.

    Received July 22, 1986; revised December 10, 1986; accepted December19, 1986.0 1989 by John Wiley & Sons, Inc.

    Deterministic models of text include grammatical modelsand semantic models. Stochastic models include modelsof text generation, models of sentence structure, models ofvocabulary size, models of rank frequency relations, andmodels of type-token relations. Type-token models areconcerned with relationships between the number of dif-ferent words (types) and the total number of words (tokens)in a literary text. Usually, each word form is regarded as adistinct token, but sometimes inflected word forms arecounted with the root word, see Yule [25] and Thomdike[23]. Recently, Edmundson [5] classified models of type-token relations as: the type-token ratio [8], the bilogarith-mic type-token ratio [2,8], and indices of vocabularyrichness [6,7] and vocabulary concentration [25].

    In this paper, we identify an interesting identity relatingthe type-token ratio and the bilogarithmic type-token ratio.We give a theoretical justification of the identity based onthe stochastic model of text proposed by Simon in 1955[ 161. The model is called the Simon-Yule model becausehe derived the same equations as Yule [24] in a study ofbiological problems. This model is useful for its ability todescribe the rank-frequency relations observed by Zipf[26], the so-called Zipfs law. Subsequently, a basic as-sumption underlying the Simon-Yule model is relaxed toincrease realism. The modified model is tested and alsoshown to support the type-token identity. Further refine-ments relating to computational models of text generationare discussed.

    2. The Type-Token IdentityDefine t as the total number of words or tokens used in

    a text, and V, as the number of different words or typesfound in the same text. Compared with bilogarithmic type-token ratio, the type-token ratio is not as stable. However,when the two ratios are summed the following identity isapproximately true:

    3+-r In Vin t = 1.f (1)

    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 40(1):45-53, 1989 CCC 0002-8231/89/010045-09$04.00

  • 7/28/2019 Type Token

    2/9

    The last column of Table 1 and Table 2 provides several that the vocabulary size V, approaches infinity as does theexamples. text length t. Herdan [8] (on p. 26) pointed out:

    Table 1 shows the ,fourteen comedies, the fourteentragedies, and the ten historical plays of Shakespeare [ 191.Table 2 lists some examples of the bilogarithmic ratio,written in Czech by different authors [21]. The texts usedare among diverse fields which include fiction, popular,and scientific literature. Elsewhere, Herdan [8] showedthat the bilogarithmic type-token ratio provided linearfits for sixty-one works of Chaucer and two works ofPushkin. In Tables 1 and 2, the bilogarithmic ratio 0.8shows up in almost texts.

    It appears, however, that to use the Zipf-Mandelbrot lawas the starting point for the derivation of the bilogarithmictype/token relation makes matters unnecessarily compli-cated, and even unsatisfactory from the mathematical pointof view. Since, in order to arrive at the relation, Devooghthas to make certain assumptions which are not in ac-cordance with reality, e.g. to let the vocabulary approachinfinity.

    In 1957, the Dutch mathematician Devooght [3] derivedthe constancy of the bilogarithmic type-token ratio fromthe generalized Zipfs law [12] under the assumption

    As an alternative, Herdan assumed that the growth rate ofthe vocabulary is proportional simultaneously to (a) in-creasing text length, (b) the size of vocabulary already ac-cumulated, and (c) the particular conditions of writing,

    TABLE 1. The 14 comedies, the 14 tragedies, and the 10 historical plays of Shake-speare [19].

    Playv,

    t v, fCOMEDY

    1. Comedy of Errors2. The Tempest3. A Midsummer Nights Dream4. Two Gentlemen of Verona5. Twelfth Night6. The Taming of the Shrew7. Much Ado About Nothing8. Merchant of Venice9. Loves Labours Lost

    10. Merry Wives of Windsor11. Measure for Measure12. As You Like It13. Alls Well That Ends Well14. A Winters TaleTRAGEDY

    1. Macbeth2. Pericles3. Timon of Athens4. Julius Caesar5. Titus Andronicus6. Two Noble Kinsmen7. Anthony and Cleopatra8. Romeo and Juliet9. King Lear

    10. Troilus and Cressida11. Othello12. Coriolanus13. Cymbeline14. HamletHISTORICAL PLAY1. King John

    2.HenryVIPartl3. Richard II4. Henry VI Part 35. Henry VIII6. Henry IV Part 17. HenryVIPart28. Henry V9. Henry IV Part 2

    10. Richard III

    14,36916,03616,08716,88319,40120,41120,76820,92121,03321,11921,26921,30522,55024,543

    16,43617,72317,74819,11019,79023,40323,74223,91325,22125,51625,88726,57926,77829,551

    20,38620,51521,80923,29523,29523,95524,45025,57725,70628,309

    3576381236713581355838174058456241224092

    0.175 0.8250.186 0.8300.168 0.8220.154 0.8140.153 0.8130.159 0.8180.166 0.8220.178 0.8300.160 0.8200.145 0.811

    25223149298427183096324029543265377232673325324835133913

    33063270326928673397389539063707416642513783401542604700

    0.176 0.8180.196 0.8320.185 0.8260.161 0.8120.160 0.8140.159 0.8150.142 0.8040.156 0.8130.179 0.8270.155 0.8130.156 0.8140.152 0.8110.156 0.8150.159 0.818

    0.201 0.8350.185 0.8270.184 0.8270.150 0.8080.172 0.8220.166 0.8220.165 0.8210.155 0.8150.165 0.8220.167 0.8230.146 0.8110.151 0.8140.159 0.8200.159 0.821

    0.9941.0281.0120.9730.9740.9730.9460.9691.0070.9670.9700.9640.9700.978

    1.0361.0121.0110.9580.9940.9880.9860.9700.9870.9900.9570.9650.9790.980

    1.0001.0150.9900.9680.9660.9770.9881.0090.9800.956

    46 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989

  • 7/28/2019 Type Token

    3/9

    TABLE 2. Examples of type-token relationships in Czech [21].

    Textv, In v,

    t K t In I1 8374 1337 0.160 0.797 0.9572 8729 2116 0.242 0.844 1.0863 9290 2360 0.254 0.850 1.1044 10,103 2831 0.280 0.862 1.1425 14,714 2790 0.190 0.827 1.0176 17,249 1831 0.106 0.770 0.8767 18,085 2372 0.131 0.792 0.9238 18,448 3088 0.167 0.818 0.9859 20,340 3870 0.190 0.833 0.02310 20,603 3502 0.170 0.822 0.992

    11 21,640 5006 0.231 0.853 1.08512 21,963 4145 0.189 0.833 1.02213 23,231 3970 0.171 0.824 0.99514 23,802 3381 0.142 0.806 0.94815 24,353 4858 0.199 0.840 1.04016 25,658 4308 0.175 0.827 1.00217 26,908 3577 0.133 0.802 0.93518 28,084 6111 0.218 0.851 1.06919 29,803 5559 0.187 0.837 1.02420 29,813 4516 0.151 0.817 0.96821 30,145 6498 0.216 0.851 1.06722 30,281 5469 0.181 0.834 1.01523 31,195 4188 0.134 0.806 0.94024 31,250 4366 0.140 0.810 0.95025 31,655 3388 0.107 0.784 0.89126 32,972 4750 0.144 0.814 0.95827 33,700 5916 0.176 0.833 1.00928 33,774 6265 0.185 0.838 1.02429 35,187 6927 0.197 0.845 1.04230 35,273 6939 0.197 0.845 1.04131 29,360 5539 0.189 0.838 1.02732 47,542 8673 0.182 0.842 1.02433 55,164 8675 0.157 0.831 0.988

    such as style, content, etc. Treating V, and t as continuousvariables, he established two differential equations fromthe assumptions, and derived the bilogarithmic type-tokenrelation. Mandelbrot [12] also attempted to explain type-token relationship by Zipfs law. First, he assumed [13](on p. 216) that the use of successive words is independent,as in the multinomial urn model. Second, he assumed V,and t as continuous variables. Finally, he approximated thesum of certain terms by integration.

    The type-token identity in Eq. (1) is quite stable exceptwhen t is relatively small or large. Table 3 shows thatwhen t 5 1100 or t 2 884647, the equation does not hold.So, we ask why does equation (1) remain stable, and underwhat conditions is the equation true? We pursue thesequestions in the following sections after introducing theSimon-Yule model of text generation.

    3. The Simon-Yule Model of Text GenerationAccording to Simon, the stochastic process by which

    words are chosen to be included in written text is a two-fold process. Words are selected by an author by processesof association (i.e. sampling earlier segments of his wordsentences) and by imitation (i.e. sampling from other

    works, by himself or other authors). Simons selection pro-cesses are stated in the following assumptions, wheref(n, t) is the number of different words that have occurredexactly n times in the first t words.

    Assumption I: There is a constant probability, cy, that the(t + I)-st word be a new word-a word that has not oc-curred in the first t words.Assumption II: The probability that (t + I)-st words is aword that has appeared n times is proportional to rzf(n,t)-that is, to the total number of occurrences of all the wordsthat have appeared exactly n times.

    A theoretical justification of Eq. (1) based on two assump-tions is discussed in Section 4.With regard to the first assumption of a constant proba-

    bility CY,Simon [9, p. 41 noted: We cannot conclude fromthis that the theory should be rejected; the only valid con-clusion to be drawn is that the theory is only a first approxi-mation and that the next step in the investigation is to lookfor an additional mechanism that should be incorporated inthe theory so as to give a better second approximation.This problem is discussed further in Section 5.

    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989 47

  • 7/28/2019 Type Token

    4/9

    TABLE 3. Examples of type-token relationship [15].

    Textv,

    I v, 71. La Gorse by Rousseau(French)2. LaCorse3. I-III John (Greek)4. Language of Peiping

    (Chinese)5. La Corse6. The Captains Daughter

    by Pushkin (Russian)7. Plautus (Latin)8. Epistles of Paul (Greek)9. American Newspaper

    (English) (Compiled byEldridge)

    10. Moby Dick by Melville(English)

    11. Ulysses by Joyce(English)

    12. Shakespeare (English)13. Shakespeare

    (Taylor poem)*

    300 157 0.523 0.886 1.4101100 446 0.405 0.871 1.2772599 623 0.240 0.818 1.058

    13,248 3332 0.252 0.855 1.10619,400 3223 0.166 0.818 0.98429,345 4783 0.163 0.824 0.98733,094 8437 0.255 0.869 1.12437.327 7281 0.195 0.845 1.040

    43,989 6002 0.136 0.814136,800 14172 0.104 0.808260,430 29899 0.115 0.826884,647 31534 0.036 0.757

    429 258 0.601 0.916

    0.9500.9120.9410.7921.518

    *This poem was recently discovered by Shakespeare scholar Gary Taylor on No-vember 14, 1985 [ll, 201. In their interesting paper, Thisted and Efron [22] used a non-parametric empirical Bayes model [6] to examine the poem and found that it wasactually written by Shakespeare. The poem was called the Taylor poem to name itafter the discoverer.

    Assumption II in the Simon-Yule model is intendedto incorporate imitation and association as the basis fora stochastic model of text generation. Simon [9, p. 971argued:

    The rationale. . . (is) as follows: Writing and speaking in-volve both imitative and associative processes. They in-volve imitation because any given piece of writing orspeech is simply a segment from the whole stream of com-munication in the language. The subjects of communica-tion depend, in large measure, upon what subjects havebeen previously communicated about in the language, andare being communicated about contemporaneously. Vocab-ulary choice depends sensitively on the choices of otherwriters and speakers (e.g. the choice, in AmericanEnglish, among auto, and car, automobile, motorvehicle).Simon continued as follows:Communication also involves association, because the as-sociative processes in the communicators memory areused by him in generating the sentences he writes or utters.Both imitation and association will tend to evoke any par-ticular word with a probability somewhat proportional toik frequency of occurrence in the language, and to its fre-quency of use by the communicator.Successive refinements of the second assumption are

    discussed in Section 6.

    4. A Theoretical Justification of the IdentityBased on the Simon-Yule Model

    Let us define V, = 1, which serves as the first old wordfor the generation of text. Without loss of generality andfor the convenience of mathematical treatment, we do notinclude this word in the type-token study. The followinglemma shows that v/t = (Y, when t is large. This is truein any written text where f is usually over 10,ooO words.Lemmal:Fort = 1,2,...,wehave

    (1) v,lim 7 = (Y with probability one;I--t-V(2) Fort= 1,2 ,..., var =o cr(1 - a)t f *

    Proof: Define

    1x, = if the t-st word is a new word0 if the t-st word is an old wordthen

    P(X, = 1) = (Y and P(X, = 0) = 1 - (YThat is, X, is a Bernoulli distribution with parameter (Y.Also, the random variables X,, X,, . . . . ,X, are indepen-dent. Thus, V, = x:=,X, is a binomial distribution withtwo parameters t and (Y.

    48 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989

  • 7/28/2019 Type Token

    5/9

    (1) From the strong law of large number, we havelim x, + - * * + x, v,t-m t = lim ; = (Y with probability one.*+c=

    (2) var(V,) = tcu(1 - a) implies var

    = a(1 - 4t .Theoreml:Fort= 1,2,...,

    pJc In (YIn t = 1 + in + term of smaller order.

    Proof: Fort = 1,2,...,$J 1,:In t =1+- In t

    By Mean Value Theorem, we havef 7 --f(a) f(a*) - (Y0 ( 1

    wheref(x) is a continuous for Vr/t 5 x 5 a (or V,/t 2x 2 a), and poses a derivative at each x for V,/t < x < a(or Vi/t > x > a); (Y* is a number between V,/t and (Y. letf(x) = In x, x > 0, then

    lnT=lnol+-& :-CX .( >and1 v- 1-a

    * =1+=+ ( 1* tIn t lnt From Lemma 1, one has

    jg In (YIn r = 1 + in + term of smaller order,

    with probability one.When t is as large as the size of a written text, then the

    term of small order indicated above may be neglected.In such cases, we can consider the possible range forIn V,/ln t in the following corollary.Corollary 1: Let 0 < amin 5 (Y 5 (Y,, < 1 and tmin5t 5 tm, then

    In (Y In V2+1sL lnaln tmax In t SC++.

    Proof: Since amin, (Y, (Y,, are all less than one and greaterthan zero, we have In (Ymax 5 In (Y 5 In (Y,,,~,,.AlsoIn tmin5 In t I In t,,,,, we have

    ln amu.= < n inln , lnt In t,,From Theorem 1, we obtain

    In cr In VA+ IS----! Ina.ln Lx In t sG++.

    Corollary 1 shows that given the values (or the ranges)of (Y and t, we can predict the value (or the range) ofIn V,/ln t. Two examples are illustrated below: In Table 1,we see that the last two columns have very stable numbers.The column In V,/ln t has the numbers around 0.8 and thecolumn V,/t + In V,/ln t has the numbers around 1.00.We explain this interesting phenomenon in the followingcorollary.Corollary 2: If 0.145 5 (Y 5 0.201 and 16,436 2t 5 29,551, then

    &J0.801 5 ln t 5 0.844and

    In V0.946 5 ; + fIn t 5 1.045Proof: Using Corollary 1, the proof is immediate.

    As we can see, in Table 1, the numbers in the In V, /ln tcolumn are all within 0.801-0.844. Also, the numbers inthe last column are all within 0.946-1.045. In Table 2, wealso notice the stable patterns in the last two columns. Thecolumn In V,/ln t has numbers around 0.8 and the columnV,/t + In V,/ln t has numbers around 1.00. The followingcorollary explains this interesting phenomenon.Corollary 3: If 0.106 5 a! s 0.280 and 8,374 5t 5 55,164, then

    q0.752 5 ln t 5 0.883and

    In V0. 858 5 T + fIn t

    S 1.163Proof: Using Corollary 1, the proof is immediate.

    As we can see, in Table 2, the numbers in the In V,/ln tcolumn are all within 0.752-0.883. Also, the numbers inthe last column are all within 0.858-1.163.

    5. A Refinement of the Simon-Yule ModelSimon recommended further refinement of his model by

    modifying the assumptions so as to better represent the realworld. This process of successive approximation is demon-strated in his study of the size distribution of businessfirms [9], where he was able to give a significant eco-nomic explanation of the two assumptions and to show theeffect of public policy on the size of firms. Furthermore,he explained the concavity to the origin of the log-log plotof Zipfs law by allowing for mergers and acquisitions, au-tocorrelated growth of firms, and a decreasing entry ratefor new firms. Since the latter modification is directly ap-plicable to text, it is considered below in greater detail.

    In this section the first assumption of the Simon-Yulemodel is modified [18] so that the entry rates for new

    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989 49

  • 7/28/2019 Type Token

    6/9

    words are a decreasing function of the length of the text.That is,

    Assumption I: There is a decreasing probability functiona(t), 0 5 cr(t) 5 1, that the (t + I)-st word be a newword-a word that has not occurred in the first t words.

    Even for this slight modification, the problem is analyti-cally untractable. We cannot extend the proof in Lemma 1and Theorem 1 to examine equation (1) now, since V, is nolonger a binominal distribution.

    A good way to examine the type-token relationship isby using simulation methods. Simon used computer simu-lation methods to examine the fit of the stochastic modelsto word frequency data under relaxation of the assumptionof a constant rate of entry of new words. We do three ex-periments by choosing:

    (1) a(t) = 0.5, 1 5 t 5 100,a(t) = 0.179, 100 < t s 10,000.

    (2) a(t) = 0.386, l~t~1000,a(t) = 0.217, loo0 < t % 2000 )a(t) = 0.160, 2000 < t s 3000 )a(t) = 0.240, 3000 < t 5 4000,a(t) = 0.160, 4000 < t i 5000,a(t) = 0.139, 5000 < t 5 10557.

    (3) a(t) = 0.217, 1 I t 5 26,600,a(t) = 0.160, 26,600 < t 5 45,900,a(t) = 0.089, 45,900 -=I t 5 84,000 )a(t) = 0.093, 84,000 < t 5 109,400,a(t) = 0.089, 109400 < I S 134200,a(t) = 0.075, 134200 < t 5 160600,a(t) = 0.065, 160600 < t 5 186800,a(t) = 0.072, 186800 < t 5 213400,a(r) = 0.078, 213400 < t 5 234100.

    These three functions are used in Simons simulationson Zipfs law [ 181. The rationale for the first function isthat certain common function words come into a text at avery early stage, say t = 100 or 200. Thus, the entry rateis initially quite high, drops off rapidly, and then maintainsitself at a relatively constant low level. The second func-tion is established from a piece of continuous prose,10,557 words in length, written by a schizophrenic, Jack-son M. The last function comes from a very large sampleof text from a Russian physics journal, 234,096 wordsin length.

    The simulation program, originally written by BrienJohnson [lo], is modified and tested here. Table 4 to Table 6TABLE 4. Type-token relationship with decreasing entry rates for new words:Case I.

    Simulation I4750561264747336776781988629906094919922

    Simulation II2879359843175036575564747193791286319350

    Simulation III2784379444495559666977798334888994449999

    931 0.19600 0.80751 1.003511099 0.19583 0.81112 1.006951264 0.19524 0.81386 1.009101411 0.19234 0.81479 1.007131486 0.19132 0.81538 1 OQ6701546 0.18858 0.81488 1.003461609 0.18646 0.81468 1.001151673 0.18466 0.81460 0.999261741 0.18344 0.81482 0.998261820 0.18343 0.81571 0.99914

    594 0.20632 0.80185 1.00817730 0.20289 0.80520 1.00809864 0.20014 0.80780 1.00794988 0.19619 0.80894 1.005131127 0.19583 0.81167 1.007501264 0.19524 0.81386 1.00910

    1390 0.19324 0.81490 1.008151504 0.19009 0.81504 1.005131609 0.18642 0.81466 1.001081715 0.18342 0.81451 0.99793569 0.20438 0.79982 1.00420790 0.20288 0.80705 1.00993886 0.19915 0.80790 1.007051086 0.19536 0.81064 1.006001290 0.19343 0.81343 1.006861488 0.19128 0.81539 1.00667

    1570 0.18838 0.81510 1.003491647 0.18529 0.81459 0.999871735 0.18371 0.81489 0.998601832 0.18322 0.81574 0.99896

    50 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989

  • 7/28/2019 Type Token

    7/9

    TABLE 5. Type-token relationship with decreasing entry rates for new words: Jacksondata.

    Simulation I316339524741553063197108789786869475

    10,264Simulation II

    49365876681677568461893191669401987110,341

    Simulation III33,33855,56066,67 188,893

    133,337155,559166,670188,892222,225233,336

    825 0.26083 0.83325 1.09408993 0.25127 0.83322 1.08449

    1114 0.23497 0.82889 1.063861245 0.22514 0.82698 1.052121357 0.21475 0.82422 1.038971475 0.20751 0.82269 1.030201574 0.19132 0.82028 1.019591660 0.19111 0.81753 1.008641755 0.18522 0.81585 0.001071860 0.18122 0.81507 0.99629

    1153 0.23359 0.82901 1.062601293 0.22005 0.82556 1.045611426 0.20921 0.82277 1.031991560 0.20113 0.82093 1.022071637 0.19348 0.81836 1.011841686 0.18878 0.81674 1.005521717 0.18732 0.81641 1.003731747 0.18584 0.81605 1.001881807 0.18306 0.81539 0.998451869 0.18074 0.81494 0.99567

    5091 0.15271 0.81956 0.972268246 0.14842 0.82538 0.973809828 0.14741 0.82764 0.97505

    12,845 0.14450 0.83024 0.9747418,916 0.14187 0.8345 1 0.9763821,972 0.14125 0.83628 0.9775223,568 0.14141 0.83731 0.9787226,628 0.14097 0.83873 0.9797031,341 0.14103 0.84090 0.9819332,900 0.14100 0.84151 0.98251

    list simulation results for each function alpha. The interest- Further refinements of the Simon-Yule model of texting observations are are possible and should be based on a deeper theoretical

    g In V= 0.80 and : + 2understanding of the nature of text generation. A promis-

    In t In t = 1.00. ing way of doing this is to effect a closer relationship be-tween statistical and computational models of text.Statistical models provide a powerful descriptive approach

    6. Conclusions based on empirical observation. Computational models, onIn this paper we show three significant contributions: the other hand, take a constructive approach and attempt tocreate text which is similar to human writing through a(1)

    (2)

    (3)

    Establish a type-token identity relating the type-tokenratio and the bilogarithmic type-token ratio. Theplays of Shakespeare and other interesting texts serveas demonstrative examples.The type-token identity is derived from the Simon-Yule model which is useful in explaining Zipfs law.An important implication of this result is to provide afurther support for the use of the Simon-Yule modelas a promising statistical model of text generation.A realistic refinement of the Simon-Yule model ismade by considering a decreasing entry rate for newwords in the generation of text. Simulation methodsare used to show that the type-token identity is pre-served under this assumption.

    deeper understanding of linguistic processes. Current com-putational approaches, however, do not take advantage ofthe inherent statistical properties of text generation. Theauthors believe that further refinement of the Simon-Yulemodel based on computational theory is a promising wayto bridge the gap between the deterministic and stochasticapproaches and to develop better models of text.

    AcknowledgmentThis research was supported in part by the National

    Science Foundation Grant IST-7911893Al and by theCouncil on Research of Louisiana State University.

    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989 51

  • 7/28/2019 Type Token

    8/9

    TABLE 6. Type-token relationship with decreasing entry rates for new words: Russiandata.

    ReferencesChen, Y. S. Statistical Models of Text: A System Theory Approach.Ph.D. dissertation, Purdue University; 1985.Chotolos, .I. Studies in Language Behavior. Psycho/ogy Mono-graph. V56; 1944.

    12.

    Devooght, J. Sur la loi de Zipf-Mandelbrot. Bull. Cl. Sci. Acad.Roy. Belg. 4; 1957.

    Lelyveld, J. A Scholars Find: Shakespearean Lyric. The NewYork Times, (November 24, 1985); l-12. With corrections ofEditorss Note, (November 25, 1985); 2.Mandelbrot, B. An Information Theory of the Statistical Structureof Language. Proceedings of the Symposium on Applications ofCommunication Theory. London, September 1952. London: Butter-worths; 1953: 486-500.

    13.

    14.5.

    6.

    Edmunson, H. P. Statistical Inference in Mathematical and Compu-tational Linguistics. international Journal of Computer and lnfor-mation Sciences. 95-129, V6, N2; 1977.Edmumson, H. P. Mathematical Models of Text. InformafionProcessing & Management. 20( l-2): 261-268; 1984.Efron, B. and Thisted, R. Estimating the Number of Unseen Spe-cies: How Many Words Did Shakespeare Know? Biometrika 63,435-437.

    15.

    Mandelbrot, B. Final Note on a Class of Skew Distribution Func-tions: Analysis and Critique of a Model Due to H. A. Simon. In-formation and Control. 4, 198-216; 1961.Miller, G. and Chomsky, N. Finitary Models of Language Users.Handbook of Mathematical Psychology (Edited by R. Lute, R. Bushand E. Galanter, Vol. II, pp. 419-491. Wiley, New York (1963).Parunak, A. Graphical Analysis of Ranked Counts (of words).Journal of the American Statistical Association. 74(365):25-30;1979.

    7.8.9.

    10.

    Guiraud, P. Les Caracteres Statistiques du Vocabulaire. PressesUniversitaires de France, Paris; 1954.Herdan, G. Type-Token Mathematics: A Textbook of MathematicalLinguistics. Moutor & Co., The Hague; 1960.Ijiri, Y.; Simon, H. A. Skew Distributions and the Sizes of BusinessFirms. North-Holland Publishing Company; 1977.Johnson, B. D. Analysis and Simulation of the information Produc-tivityof Scientific Journals. Master Thesis , School of Industrial En-gineering, Purdue University; 1983.

    16. Simon, H.A. On a Class of Skew Distribution Function. Bio-metrika. 42:425-440; 1955.

    17.18.19.

    Simon, H. A. Some Further Notes on a Class of Skew DistributionFunctions. Information and Control. 3:80-88; 1960.Simon, H. A. Some Monte Carlo Estimates of the Yule Distribu-tion. Behavior Science. 8:203-210; 1963.Spevack, M. A Complete and Systematic Concordance to theWorks of Shakespeare. Vols. I-IV. George Olms, Hildesheim;1968.

    52 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989

    v, In V,t v, T In

    +ztSimulation I

    30,008 6235 0.20778 0.84758 1.0553650,008 9163 0.18323 0.84316 1.0263970,008 10,932 0.15615 0.83356 0.9897180,008 11,793 0.14740 0.83041 0.9778190,008 12,690 0.14099 0.82827 0.96925110,008 14,496 0.13177 0.82541 0.95718

    130,008 16,263 0.12509 0.82347 0.94856150,008 17,811 0.11873 0.82121 0.93995180,008 19,878 0.11043 0.81791 0.92834220,008 22,789 0.10358 0.81568 0.91926

    Simulation II11,385 2423 0.21282 0.83434 1.0471656,893 9794 0.17215 0.83931 1.0114668,270 10,797 0.15815 0.83432 0.9924779,647 11,753 0.14756 0.83044 0.97801

    102,401 13,774 0.13451 0.82611 0.96062113,778 14,812 0.13018 0.82487 0.95506136,532 16,836 0.12331 0.82299 0.94630159,286 18,502 0.11616 0.82028 0.93643204,794 21,581 0.10538 0.81601 0.92139227,548 23,396 0.10282 0.81558 0.91840

    Simulation III55,560 9671 0.17406 0.83997 1.0140466,671 10,664 0.15995 0.83499 0.9949477,782 11,595 0.14907 0.83099 0.9800688,893 12,594 0.14168 0.82851 0.97018

    100,004 13,580 0.13579 0.82658 0.96237111,115 14,589 0.13130 0.82525 0.95655133,337 16,548 0.12411 0.82318 0.94728155,559 18,211 0.11707 0.82057 0.93764177,781 19,726 0.11096 0.81812 0.92908233,336 23,856 0.10224 0.81550 0.91774

    11.

  • 7/28/2019 Type Token

    9/9

    20. Taylor, G. Shakespeares New Poem: A Scholars Clues and Con-clusions. New York Times Book Review (December 15), 11-14.

    21. Tesitelova, M. On the So-called Vocabulary Richness. PragueStudies in Mathematical Linguistics. 103-120; 1971.

    22. Thisted, R. and Efron, B. Did Shakespeare Write a Newly-Discovered Poem? Technical Report No. 244. Department ofStatistics, Stanford University. April 1986.

    23. Thomdike, E. L. Book Review-National Unity and Disunity by

    G. K. Zipf. Science. 4 July, 1941, V94, p. 19.24. Yule, G. U. A Mathematical Theory of Evolution, Based on the

    Conclusions of Dr. J. C. Willis, E R. S. , Philosophical Transac-tions of the Royal Society of London, Series B. 213:21-87; 1924.

    25. Yule, G. U. A Statistical Study of Vocabulary. Cambridge, Eng-land: Cambridge University Press; 1944.

    26. Zipf, G. K. Human Behavior and the Principle of Least Effort .Reading, MA: Addison Wesley, 1949.

    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1989 53