Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex...

Automatic Authorship Identification (Part II)

Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts,

and David D. Lewis

Acknowledgements

• Support– U.S. National Science Foundation

• DIMACS REU 2004• Knowledge Discovery and Dissemination Program

• Disclaimer– The views expressed in this talk are those of the

authors, and not of any other individuals or organizations.

Outline

I. Recap

II. New Federalist Paper Results

III. New E-mail Data Results

IV. Conclusions and Future Work

The Authorship Problem

• Given:– A piece of text with unknown author– A list of possible authors– A sample of their writing

• Problem:– Can we automatically determine which person

wrote the text?

The Authorship Problem

• Given:– A piece of text

– A list of possible authors

– A sample of their writing

• Problem:– Can we automatically determine which person wrote

the text?

• Approach:– Use style markers to identify the author

The Federalist Papers

• 85 Total

• 12 Disputed

Previous Work: Mosteller and Wallace (1964)

• Function Words

Upon Also An

By Of On

There This To

Although Both Enough

While Whilst Always

Though Commonly Consequently

Considerable(ly) According Apt

Direction Innovation(s) Language

Vigor(ous) Kind Matter(s)

Particularly Probability Work(s)

Our Previous Work: Trials with the Federalist Papers

• Wrote scripts in Perl and Python to compute– Sentence length frequencies– Word length frequencies– Ratios of 3-letter words to 2-letter words

• Analyzed our data with graphing and statistics software.

Previous Conclusions

• Not too helpful…but there is hope!– Try more features– Try different features

Feature Selection• Which features work best?• One way to rank features:

– Make a contingency table for each feature F– Compute abs ( log ( ad / bc ) )– Rank the log values

a b

c d

F

Madison

Hamilton

Not F

49 Ranked Features

Linear Discriminant Analysis

• A technique for classifying data

• Available in the R statistics package

• Input:– Table of training data– Table of test data

• Output:– Classification of test data

Linear Discriminant Analysis: example

Input training data:

upon 2-letter 3-letter

M 0.000 206.943 194.927

M 0.000 212.915 194.665

M 0.369 202.583 190.775

M 0.000 201.891 213.712

M 0.000 236.943 206.221

H 3.015 235.176 187.940

H 2.458 226.647 201.082

H 4.955 232.432 192.793

H 2.377 232.937 186.078

H 3.788 224.116 196.338

upon 2-letter 3-letter

0.000 226.277 203.163

0.908 205.268 181.653

0.000 225.536 182.627

0.000 217.273 183.053

1.003 232.581 184.962

Input test data:

Ouput:m m m m h

Some more LDA results

• 12 to Madison:– upon, 1-letter, 2-letter– upon, enough, there– upon, there

• 11 to Madison:– upon, 2-letter, 3-letter

• < 6 to Madison– 2-letter, 3-letter– there, 1-letter, 2-letter

Some more LDA results

Class Output of lda Features tested

12 M m m m m m m m m m m m m

upon apt 9 2

12 M m m m m m m m m m m m m

to upon 2 3

11 M m m m m m m h m m m m m

on there 2 13

11 M h m m m m m m m m m m m

an by 5 10

10 M m m m m m m h m m m h m

particularly probability 3 9

8 M m m m m m m h h h m h m

also of 1 4

8 M m m m h m m h h m m h m

always of 1 3

7 M h m m h m h h m h m m m

of work 5 2

6 M m m h m m m h h m h h h there language 1 8

5 M m h m h h m h h h m m h consequently direction 5 11

Feature Selection Part II

• Which combinations of features are best for LDA?

• Are the features independent?• We did some random sampling:

– Choose features a, b, c, d– Compute x = log a + log b + log x + log d– Compute y = log (a+b+c+d)– Plot x versus y

Selecting more features

• What happens when more than 4 features are used for the lda?

• Greedy approach– Add features one at a time from two lists– Perform lda on all features chosen so far

• Is overfitting a problem?

First few greedy iterations

6 M 6 H h m h h m h m m h m h m

2-letter words

12 M 0 H m m m m m m m m m m m m upon

12 M 0 H m m m m m m m m m m m m 1-letter words


11 M 1 H

m m m m m h m m m m m m 4-letter words

12 M 0 H m m m m m m m m m m m m there

12 M 0 H m m m m m m m m m m m m enough

11 M 1 H m m m m m m h m m m m m whilst


11 M 1 H m m m m m m h m m m m m 15-letter words

Listserv Data

• 70 Listerv archives

• Over 1 million e-mail messages

• Data was gathered by Andrei Anghelescu– http://mms-02.rutgers.edu/ListServ/

Our Data

• One Listserv, “CINEMA-L”

• 992 authors, 41263 messages

• We look at 3 authors– sstone 1077 messages– thea70 1253– jmiles_2 1481

Frustration

Feature Selection

• How do we find “good” features?

More Frustration

A Measure of Variance

Summary of LDA Results

• Ran LDA using “I”, “is”, and “think”

• Trained on 80%, tested on 20%

• Correctly classified 122/186 documents

Future Work• Finish our 3 author experiment

• Use more and different features– Structural– E-mail specific features

• Analyzing the relationship among features

• Other authorship id problems– Many authors– Odd-man-out

Thanks!!!

[email protected]@dimax.rutgers.edu

Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex...

Documents

Transcript of Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex...