Download - FotiKee_Report2

Data Mining Final ProjectNick FotiEric Kee

Topic: Author IdentificationAuthor Identification Given writing samples, can we determine who wrote them?This is a well studied fieldSee also: stylometryThis has been applied to works such asThe BibleShakespeareModern texts as well

Corpus DesignA corpus is: A body of text used for linguistic analysisUsed Project Gutenberg to create corpusThe corpus was designed as followsFour authors of varying similarityAnne BrontCharlotte BrontCharles DickensUpton SinclairMultiple books per authorCorpus size: 90,000 lines of text

Dataset DesignExtracted features common in literatureWord LengthFrequency of glue wordsSee Appendix A and [1,2] for list of glue words Note: corpus was processed usingC#, Matlab, PythonData set parameters areNumber of dimensions: 309 Word length and 308 glue wordsNumber of observations: 3,000Each obervation 30 lines of text from a book

Classifier Testing and AnalysisTested classifier with test dataUsed testing and training data sets 70% for training, 30% for testingUsed cross-validationAnalyzed Classifier PerformanceUsed ROC plotsUsed confusion matrices

Used common plotting scheme (right)Anne B.TPAnne B.FPCharlotte B.TPCharlotte B.FP78%22%55%45%E X A M P L ERed Dots Indicate True-Positive Cases

Binary Classification

Word Length ClassificationCalculated average word length for each observationComputed gaussian kernel density from word length samplesUsed ROC curve to calculate cutoffOptimized sensitivity and specificity with equal importance

Word Length: Anne B. vs Upton S.100%0%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P100%0%Anne BrontCharlotte Bront

Word Length: Bront vs. Bront100%0%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P78.1%21.9%Anne BrontCharlotte Bront

Principal Component AnalysisUsed PCA to find a better axisNotice: distribution similar to word length distributionIs word length the only useful dimension?

Anne Bront vs. Upton SinclairWord Length DensityPCA Density

Principal Component Analysis

It appears that word length is the most useful axisWell come back to thisWithout word lengthAnne Bront vs. Upton SinclairPCA Density

K-MeansUsed K-means to find dominant patternsUnnormalized NormalizedTrained K-means on training setTo classify observations in test setCalculate distance of observation to each class meanAssign observation to the closest class

Performed cross-validation to estimate performance

Unnormalized K-means Anne Bront vs. Upton Sinclair98.1%1.9%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P92.1%7.9%

Unnormalized K-means Anne Bront vs. Charlotte Bront95.7%4.3%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P74.7%25.3%

Normalized K-means Anne Bront vs. Upton Sinclair53.3%46.7%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P49.4%50.6%

Normalized K-means Anne Bront vs. Charlotte Bront15.8%84.2%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P86.7%13.3 %

Discriminant AnalysisPeformed discriminant analysis Computed with equal covariance matricesUsed average Omega of class pairsComputed with unequal covariance matricesQuadratic discrimination fails because covariance matrices have 0 determinant (see equation below)Computed theoretical misclassification probabilityTo perform quadratic discriminant analysisCompute Equation 1 for each classChoose class with minimum value (1)

Discriminant Analysis Anne Bront vs. Upton Sinclair92.2%3.8%Anne B.T PAnne B.F PUpton SinsclairT PUpton SinclairF P96.2%7.8%Theoretical P(err) = 0.149Empirical P(err) = 0.116

Discriminant Analysis Anne Bront vs. Charlotte Bront92.7%7.3%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P89.2%10.8%Theoretical P(err) = 0.181Empirical P(err) = 0.152

Logistic RegressionFit linear model to training data on all dimensionsThrew out singular dimensionsLeft with 298 coefficients + interceptProjected training data onto synthetic variableFound threshold by minimizing error of misclassificationProjected testing data onto synthetic variableUsed threshold to classify points

Logistic Regression Anne BTPAnne BTPCharlotte BTPCharlotte BTPAnne BrontCharlotte Bront89.5%10.5%8%92%Anne Bront vs Charlotte Bront

Logistic Regression Anne BTPAnne BFPUpton STPUpton SFPAnne BrontUpton Sinclair98%2%99%2%Anne Bront vs Upton Sinclair

4-Class Classification

4-Class K-meansUsed K-means to find patterns among all classesUnnormalized NormalizedTrained using a training setTested performance as in 2-class K-meansPerformed cross-validation to estimate performance

Unnormalized K-Means CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP22%54% 87% 34%59%88%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

Normalized K-Means CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP20%67%26%67%70%67%27%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

Additional K-means testingAlso tested K-means without word length Recall that we had perfect classification with 1D word length (see plot below) Is K-means using only 1 dimension to classify?

Note: perfect classification only occurs between Anne B. and SinclairAnne Bront vs. Upton Sinclair

Unnormalized K-Means (No Word Length)K-means can classify without word length CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP35%29%44%33%35%43%72%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

Multinomial RegressionMultinomial distributionExtension of binomial distributionRandom variable is allowed to take on n valuesUsed multinom() to fit log-linear model for trainingUsed 248 dimensions (max limit on computer)Returns 3 coefficients per dimension and 3 interceptsFound probability that observations belongs to each class

Multinomial RegressionMultinomial Logit Function is

where j are the coefficients and cj are the intercepts

To classifyCompute probabilities Pr(yi = Dickens), Pr(yi = Anne B.), Choose class with maximum probability

Multinomial Regression CDTPABFPCBFPUSFPCDFPABTPCBFPCBFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP78%86%83%93%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

Multinomial Regression

Multinomial regression does not require word length CDTPABFPCBFPUSFPCDFPABTPCBFPCBFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP76%79%79%91%Upton SinclairCharles DickensAnne BrontCharlotte Bront(Without Word Length)4-Class Confusion Matrix

Appendix A: Glue WordsI a aboard about above across after again against ago ahead all almost along alongside already also although always am amid amidst among amongst an and another any anybody anyone anything anywhere apart are aren't around as aside at away back backward backwards be because been before beforehand behind being below between beyond both but by can can't cannot could couldn't dare daren't despite did didn't directly do does doesn't doing don't done down during each either else elsewhere enough even ever evermore every everybody everyone everything everywhere except fairly farther few fewer for forever forward from further furthermore had hadn't half hardly has hasn't have haven't having he hence her here hers herself him himself his how however if in indeed inner inside instead into is isn't it its itself just keep kept later least less lest like likewise little low lower many may mayn't me might mightn't mine minus more moreover most much must mustn't my myself near need needn't neither never nevertheless next no no-one nobody none nor not nothing notwithstanding now nowhere of off often on once one ones only onto opposite or other others otherwise ought oughtn't our ours ourselves out outside over own past per perhaps please plus provided quite rather really round same self selves several shall shan't she should shouldn't since so some somebody someday someone something sometimes somewhat still such than that the their theirs them themselves then there therefore these they thing things this those though through throughout thus till to together too towards under underneath undoing unless unlike until up upon upwards us versus very via was wasn't way we well were weren't what whatever when whence whenever where whereas whereby wherein wherever whether which whichever while whilst whither who whoever whom with whose within why without will won't would wouldn't yet you your yours yourself yourselves

ConclusionsAuthors can be identified by their word usage frequenciesWord length may be used to distingush between the Bront sistersWord length does not, however, extend to all authors (See Appendix C)The glue words describe genuine differences between all four authorsK-means distinguishes the same patterns that multinomial regression classifiesThis indicates that supervised training finds legitimate patterns, rather than artifactsThe Bront sisters are the most similar authorsUpton Sinclair is the most different author

Appendix B: CodeSee Attached .R files

Appendix C: Single Dimension 4-Author Classification CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP22%46%94%6%11%54%96%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix3%Classification using Multinomial Regression

References

[1] Argamon, Saric, Stein, Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results, SIGKDD 2003.

[2] Mitton, Spelling checkers, spelling correctors and the misspellings of poor spellers, Information Processing and Management, 1987.