Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan...

download Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

of 17

Transcript of Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan...

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    1/17

    Applying Generalized Linear Mixed Models to Word Countsto Analyze the Literary Style of Detective Short Stories by

    A.Conan Doyle, G. K. Chesterton, and E.W. Hornung

    Work by Roger Bilisoly and Krishna Saha

    Department of Mathematical Sciences

    Central Connecticut State University

    New Britain, Connecticut

    Joint Statistical Meetings

    Miami, Florida

    August 3, 2011

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    2/17

    Stylometry

    Studying literary style goes back to the ancient Greeks: Aristotles Rhetoricdates to the 4th century B.C.

    The idea of using quantitative measures to study style is ~150years old. Augustus de Morgan suggests in a letter that studying word length

    may be useful in comparing authors. (Holmes and Kardos (2003)) Mendenhall (1887) gives the results of his analysis of word length

    by author. He compares 200,000 words of Francis Bacon and 400,000 words of

    William Shakespeare with little success according to Williams (1956).

    First big stylometry success by statisticians: Mosteller and

    Wallaces work on the Federalist Papersin early 1960s. SeeMosteller and Wallace (1984). Currently an active area of research in several disciplines:

    natural language processing, statistics, text mining,information retrieval, and physics.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    3/17

    The Early Detective Short Story Corpus

    This talk considers the following ~1,000,000 word corpus: The 56 canonical Sherlock Holmes short stories by Arthur Conan Doyle

    The 50 Father Brown short stories by G. K. Chesterton.

    The 26 A. J. Raffles short stories by E. W. Hornung.

    These stories were published between 1891-1935 and are (mostly)

    out of copyright. All are available at Project Gutenberg (www.gutenberg.org)

    Doyle wrote four longer Sherlock Holmes stories, and Hornungwrote one A. J. Raffles novel, but these are not considered here.

    A. J. Raffles is an amateur cracksman, but Hornung acknowledges

    Doyles influence on his writing. In 1893, Hornung married Constance Doyle, Arthur Conan Doyles sister.

    The dedication of his first Raffles book was To A. C. D. This Form ofFlattery.

    All three authors wrote other genres.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    4/17

    Function vs. Content Words

    The idea of a word seems obvious, but is hard to pin down. Is Holmesa word? Is 1888a word? Is knee-capsone or two words? Etc.

    Functionwords are used for syntactic purposes and their classesare closed: prepositions, pronouns, particles, some adverbs. For example, upis a particle in the following phrasal verbs:

    Mix up, throw up, sit up, screw up, make up, hit up,

    Contentwords denote one or more specific things or ideas. Verbs, nouns, adjectives, and adverbs are usually content words.

    Use of function words are believed to be unconscious habits,whereas content words are used intentionally.

    Mosteller and Wallace (1984) studied preposition use to analyze TheFederalist Papersbecause they thought the three potential authors(Hamilton, Madison, Jay) would have distinctive rates of use.

    Detective story writers, such as Doyle, Chesterton, and Hornung,intentionally use words related to crime and violence.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    5/17

    Example: PCA of Personal Pronouns(Function words)

    EigenvectorsPrin1 Prin2 Prin3 Prin4

    I 0.428270 0.120691 -.030862 0.292536

    me 0.438173 0.083333 -.064091 0.228027

    you 0.336905 0.192433 -.200798 0.257253

    he -.261557 0.340602 -.270750 -.342274

    him 0.073556 0.374285 -.323832 -.037269

    she 0.105492 -.581133 -.169591 -.040532her 0.127503 -.579041 -.157171 -.037176

    it 0.279381 0.070345 0.112264 0.217366

    we 0.324365 0.053332 0.376130 -.447410

    us 0.323786 0.057149 0.375081 -.461859

    they -.322364 0.026483 0.302032 0.365480

    them -.133618 0.024416 0.580514 0.283493

    % Explained 32.4% 19.3% 13.4% 10.8%

    Prin1 does have 3 negative weights

    Prin2 contrasts 3rd person male vs. 3rd person female.

    Prin3 contrasts plural forms with 2nd and 3rd person singular (except for it).

    Hence some grammatical patterns are detectable with the bag-of-words approach.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    6/17

    Example: Names of Characters(Content words)

    For fiction it is not hard to distinguish series of stories based onrecurring character(s): just use names, which are content words.

    StoryID D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

    Sherlock 11 10 7 10 10 10 10 9 5 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    Holmes 48 53 46 47 25 29 38 56 14 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    John 4 6 0 3 9 1 3 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 2 2 0 0 0 0 0Watson 6 10 4 4 4 16 9 12 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    Arthur 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

    Raffles 0 0 0 0 0 0 0 0 0 0 47 35 47 22 41 50 48 74 42 51 0 0 0 0 0 0 0 0 0 0

    Harry 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

    Bunny 0 0 0 0 0 0 0 0 0 0 19 15 20 20 11 10 16 27 20 25 0 0 0 0 0 0 0 0 0 0

    Manders 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    Father 0 0 18 33 13 3 1 0 1 8 0 0 2 1 0 1 0 0 0 0 12 17 22 15 11 20 43 30 16 27

    Brown 2 2 2 1 0 2 2 3 0 0 0 0 0 0 2 0 0 3 0 1 17 26 26 18 14 29 49 33 17 27

    Hercule 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    Flambeau 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 0 1 7 22 32 34 34 0 43

    Bunnys last name is not used at all in any of the Raffles stories (it appears in the

    novel Mr. Justice Raffles.) His first name appears in exactly one story.

    Watsons name does not appear in D9 although he narrates it.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    7/17

    The Statistical Model

    Response variable = word proportions Explanatory variables = author, word, word group, date

    (nuisance parameter?) plot indicator variables (future work)

    Non-normal error distribution. Data consists of word counts.

    Errors dependent due to repeated measures Random effects plausible:

    For example, both Doyle and Chesterton grew tired of writingSherlock Holmes and Father Brown stories, but did so later in

    their careers to make money. The above suggests using Generalized Linear MixedModels (GLMMs) Both SAS and R fit GLMMs to data.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    8/17

    Let Y = response vector (proportions), X = matrix of fixedeffects, and Z = matrix of random effects.

    The basic model is g(E[Y|]) = X + Z

    SAS has G-side and R-side random effects We assume that ~ Normal(0, G) and ~ Normal(0, R) are

    independent.

    G models random effects, e.g., date of publication.

    R models correlated residuals, e.g., word correlations.

    Both G and R are modeled using the RANDOM statement

    (unlike PROC GENMOD and PROC MIXEDs REPEATEDstatement.)

    R also has functions that fit GLMMs: glmmPQL (PQL = penalized quasi-likelihood)

    SASs PROC GLIMMIX for GLMMs

    The above PROC GLIMMIX info comes from SAS documentation

    at http://support.sas.com/rnd/app/papers/glimmix.pdf.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    9/17

    Example: Color Word Analysis Data

    Each row represent one story.

    For example, row 1 is Doyles first

    Sherlock Holmes short story.

    The columns are as follows:

    A = word class

    B = word

    C = # times word appears in story

    D = # words in story

    E = Story IDF = Author

    G = Year of book publication

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    10/17

    Color Word Rates

    BWG combines the

    counts of the words

    black, white, gray,

    and grey.

    RBGY combines thecounts of the words

    red, blue, green,

    and yellow.

    Chesterton clearly uses

    more color words thanthe other two. Note that

    Chesterton went to the

    Slade School of Art

    (1893-96).

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    11/17

    PROC GLIMMIX Output

    Parameter Estimates

    Standard

    Effect author Estimate Error DF t Value Pr > |t|

    Intercept -8.9077 0.1260 1053 -70.71

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    12/17

    Example: Rates (per 1000 words)of are and is.

    There is good

    Separation among

    Hornung, Chesterton,

    and Doyle.

    Concordancing can give

    insight into how these

    differences arise: see

    next slide.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    13/17

    Concordancing to Analyzeare and is usage (function words)

    Given a text pattern, all instances of this pattern are found and theresults are displayed so that the matches are lined up. Regular expressions (regexes) are a popular tool for specifying text patterns.

    The resulting matches can be sorted in different ways.

    Part of the

    matches of is

    found in Doyles

    stories. The lines

    are sorted by thefirst word on the

    left.

    Regex used:/\b(is)\b/

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    14/17

    Some Results of Concordancing

    Doyle RawCount AdjCount

    there are 182 68.5

    they are 115 43.3

    we are 169 63.6

    you are 377 141.8

    Chesterton

    there are 93 45.6

    they are 72 35.3

    we are 57 27.9

    you are 136 66.6

    Hornung

    there are 23 23.0

    they are 22 22.0

    we are 27 27.0

    you are 65 65.0

    Doyle RawCount AdjCount

    he is 330 124.1

    it is 1209 454.7

    that is 308 115.8

    there is 457 171.9

    Chesterton

    he is 194 95.0

    it is 394 193.0

    that is 192 94.0

    there is 189 92.6

    Hornung

    he is 38 38.0

    it is 100 100.0

    that is 17 17.0

    there is 18 18.0

    AdjCount takes into account that Doyles stories have 454340 words, Chesterton has

    348879 words, and only 170878 words total in Hornung.

    Stylistic differences are strong: e.g., Doyle vs. Hornung for there is.

    Software used is from Sections 3.7 and 6.5 of Bilisoly (2008)

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    15/17

    Ongoing Work

    It is easy to have models where PROC GLIMMIX fails toconverge or has memory problems.

    Find out which models work best with this data.

    Many other word classes are ready to analyze.

    For example, words related in meaning from Rogets thesaurus.

    Create a data set of plot-device indicators.

    For example, has a dead body turned up?

    Does the story involve a theft?

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    16/17

    References

    Fahrmeir, Ludwig and Gerhard Tutz (1994). Multivariate Statistical ModellingBased on Generalized Linear Models, Springer.

    Holmes, David and Judit Kardos (2003). Who Was the Author? An Introductionto Stylometry. Chance, 16, 5-8.

    Little, Ramon, George Milliken, Walter Stroup, and Russell Wolfinger (1996).SAS System for Mixed Models, SAS Institute.

    McCulloch, Charles, Shayle Searle, and John Neuhaus (2008). Generalized,Linear, and Mixed Models, Wiley. Mosteller, Frederick and David Wallace (1984). Applied Bayesian and Classical

    Inference: The Case of the Federalist Papers, Springer. Pinheiro, Jose and Douglas Bates (2000). Mixed-Effects Models in S and S-

    Plus, Springer. Raudenbush, Stephen and Anthony Bryk (2002). Hierarchical Linear Models:

    Applications and Data Analysis Methods, 2nd

    Edition, Sage. SAS Institute (2006). The GLIMMIX Procedure. Available athttp://support.sas.com/rnd/app/papers/glimmix.pdf.

    Shoukri, Mohamed and Mohammad Chaudhary (2007). Analysis of CorrelatedData with SAS and R, 3rd Edition, Chapman & Hall.

    Williams, C. B. (1956). Studies in the History of Probability and Statistics: IV. ANote on an Early Statistical Study of Literary Style. Biometrika, 43, 248-256.

  • 8/3/2019 Applying GLMMs to Word Counts to Analyze the Literary Style of Detective Short Stories by A. Conan Doyle, G. K. Chesterton, and E. W. Hornung to Detective Short Stories

    17/17

    Example of Language Complication:Many word measures are functions of sample size, even rates.

    We have seen that lexical measures such as meanfrequency and vocabulary size, as well as [modelparameter estimates] all depend on the sample size. p 24 of Word Frequency Distributionsby R. Harald Baayen

    One solution used in corpus linguistics: take equal sizedsamples from texts.

    However, stories are whole entities.

    Above is reason novellas are left out.

    An approximate solution is to consider sets ofstories close in size. Unfortunately, thedetective stories range from 2208 to 13593total words, which is about a factor of 6.