Data Mining Final ProjectNick FotiEric Kee
Topic: Author IdentificationAuthor Identification Given writing samples, can we determine who wrote them?This is a well studied fieldSee also: stylometryThis has been applied to works such asThe BibleShakespeareModern texts as well
Corpus DesignA corpus is: A body of text used for linguistic analysisUsed Project Gutenberg to create corpusThe corpus was designed as followsFour authors of varying similarityAnne BrontCharlotte BrontCharles DickensUpton SinclairMultiple books per authorCorpus size: 90,000 lines of text
Dataset DesignExtracted features common in literatureWord LengthFrequency of glue wordsSee Appendix A and [1,2] for list of glue words Note: corpus was processed usingC#, Matlab, PythonData set parameters areNumber of dimensions: 309 Word length and 308 glue wordsNumber of observations: 3,000Each obervation 30 lines of text from a book
Classifier Testing and AnalysisTested classifier with test dataUsed testing and training data sets 70% for training, 30% for testingUsed cross-validationAnalyzed Classifier PerformanceUsed ROC plotsUsed confusion matrices
Used common plotting scheme (right)Anne B.TPAnne B.FPCharlotte B.TPCharlotte B.FP78%22%55%45%E X A M P L ERed Dots Indicate True-Positive Cases
Binary Classification
Word Length ClassificationCalculated average word length for each observationComputed gaussian kernel density from word length samplesUsed ROC curve to calculate cutoffOptimized sensitivity and specificity with equal importance
Word Length: Anne B. vs Upton S.100%0%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P100%0%Anne BrontCharlotte Bront
Word Length: Bront vs. Bront100%0%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P78.1%21.9%Anne BrontCharlotte Bront
Principal Component AnalysisUsed PCA to find a better axisNotice: distribution similar to word length distributionIs word length the only useful dimension?
Anne Bront vs. Upton SinclairWord Length DensityPCA Density
Principal Component Analysis
It appears that word length is the most useful axisWell come back to thisWithout word lengthAnne Bront vs. Upton SinclairPCA Density
K-MeansUsed K-means to find dominant patternsUnnormalized NormalizedTrained K-means on training setTo classify observations in test setCalculate distance of observation to each class meanAssign observation to the closest class
Performed cross-validation to estimate performance
Unnormalized K-means Anne Bront vs. Upton Sinclair98.1%1.9%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P92.1%7.9%
Unnormalized K-means Anne Bront vs. Charlotte Bront95.7%4.3%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P74.7%25.3%
Normalized K-means Anne Bront vs. Upton Sinclair53.3%46.7%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P49.4%50.6%
Normalized K-means Anne Bront vs. Charlotte Bront15.8%84.2%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P86.7%13.3 %
Discriminant AnalysisPeformed discriminant analysis Computed with equal covariance matricesUsed average Omega of class pairsComputed with unequal covariance matricesQuadratic discrimination fails because covariance matrices have 0 determinant (see equation below)Computed theoretical misclassification probabilityTo perform quadratic discriminant analysisCompute Equation 1 for each classChoose class with minimum value (1)
Discriminant Analysis Anne Bront vs. Upton Sinclair92.2%3.8%Anne B.T PAnne B.F PUpton SinsclairT PUpton SinclairF P96.2%7.8%Theoretical P(err) = 0.149Empirical P(err) = 0.116
Discriminant Analysis Anne Bront vs. Charlotte Bront92.7%7.3%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P89.2%10.8%Theoretical P(err) = 0.181Empirical P(err) = 0.152
Logistic RegressionFit linear model to training data on all dimensionsThrew out singular dimensionsLeft with 298 coefficients + interceptProjected training data onto synthetic variableFound threshold by minimizing error of misclassificationProjected testing data onto synthetic variableUsed threshold to classify points
Logistic Regression Anne BTPAnne BTPCharlotte BTPCharlotte BTPAnne BrontCharlotte Bront89.5%10.5%8%92%Anne Bront vs Charlotte Bront
Logistic Regression Anne BTPAnne BFPUpton STPUpton SFPAnne BrontUpton Sinclair98%2%99%2%Anne Bront vs Upton Sinclair
4-Class Classification
4-Class K-meansUsed K-means to find patterns among all classesUnnormalized NormalizedTrained using a training setTested performance as in 2-class K-meansPerformed cross-validation to estimate performance
Unnormalized K-Means CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP22%54% 87% 34%59%88%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix
Normalized K-Means CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP20%67%26%67%70%67%27%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix
Additional K-means testingAlso tested K-means without word length Recall that we had perfect classification with 1D word length (see plot below) Is K-means using only 1 dimension to classify?
Note: perfect classification only occurs between Anne B. and SinclairAnne Bront vs. Upton Sinclair
Unnormalized K-Means (No Word Length)K-means can classify without word length CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP35%29%44%33%35%43%72%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix
Multinomial RegressionMultinomial distributionExtension of binomial distributionRandom variable is allowed to take on n valuesUsed multinom() to fit log-linear model for trainingUsed 248 dimensions (max limit on computer)Returns 3 coefficients per dimension and 3 interceptsFound probability that observations belongs to each class
Multinomial RegressionMultinomial Logit Function is
where j are the coefficients and cj are the intercepts
To classifyCompute probabilities Pr(yi = Dickens), Pr(yi = Anne B.), Choose class with maximum probability
Multinomial Regression CDTPABFPCBFPUSFPCDFPABTPCBFPCBFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP78%86%83%93%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix
Multinomial Regression
Multinomial regression does not require word length CDTPABFPCBFPUSFPCDFPABTPCBFPCBFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP76%79%79%91%Upton SinclairCharles DickensAnne BrontCharlotte Bront(Without Word Length)4-Class Confusion Matrix
Appendix A: Glue WordsI a aboard about above across after again against ago ahead all almost along alongside already also although always am amid amidst among amongst an and another any anybody anyone anything anywhere apart are aren't around as aside at away back backward backwards be because been before beforehand behind being below between beyond both but by can can't cannot could couldn't dare daren't despite did didn't directly do does doesn't doing don't done down during each either else elsewhere enough even ever evermore every everybody everyone everything everywhere except fairly farther few fewer for forever forward from further furthermore had hadn't half hardly has hasn't have haven't having he hence her here hers herself him himself his how however if in indeed inner inside instead into is isn't it its itself just keep kept later least less lest like likewise little low lower many may mayn't me might mightn't mine minus more moreover most much must mustn't my myself near need needn't neither never nevertheless next no no-one nobody none nor not nothing notwithstanding now nowhere of off often on once one ones only onto opposite or other others otherwise ought oughtn't our ours ourselves out outside over own past per perhaps please plus provided quite rather really round same self selves several shall shan't she should shouldn't since so some somebody someday someone something sometimes somewhat still such than that the their theirs them themselves then there therefore these they thing things this those though through throughout thus till to together too towards under underneath undoing unless unlike until up upon upwards us versus very via was wasn't way we well were weren't what whatever when whence whenever where whereas whereby wherein wherever whether which whichever while whilst whither who whoever whom with whose within why without will won't would wouldn't yet you your yours yourself yourselves
ConclusionsAuthors can be identified by their word usage frequenciesWord length may be used to distingush between the Bront sistersWord length does not, however, extend to all authors (See Appendix C)The glue words describe genuine differences between all four authorsK-means distinguishes the same patterns that multinomial regression classifiesThis indicates that supervised training finds legitimate patterns, rather than artifactsThe Bront sisters are the most similar authorsUpton Sinclair is the most different author
Appendix B: CodeSee Attached .R files
Appendix C: Single Dimension 4-Author Classification CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP22%46%94%6%11%54%96%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix3%Classification using Multinomial Regression
References
[1] Argamon, Saric, Stein, Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results, SIGKDD 2003.
[2] Mitton, Spelling checkers, spelling correctors and the misspellings of poor spellers, Information Processing and Management, 1987.