Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking...

17
Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman, Rebekah Overdorf, Sadia Afroz and Rachel Greenstadt The Privacy, Security and Automation Lab Drexel University Philadelphia, PA stolerman,rjo43,sa499,[email protected] Abstract Forensic stylometry is a form of authorship attribution that relies on the linguistic infor- mation found in a document. While there has been significant work in stylometry, most research focuses on the closed-world problem where the document’s author is in a known suspect set. For open-world problems where the author may not be in the suspect set, traditional methods used in classification are ineffective. We propose the Classify-Verify method, that augments classification with a bi- nary verification step, evaluated on stylomet- ric datasets, but can be generalized to any domain. We suggest augmentations to an existing distance-based authorship verification method, by adding per-feature standard devia- tions and per-author threshold normalization. The Classify-Verify method significantly out- performs traditional classifiers in open-world settings (p-val < 0.01) and attains F1-score of 0.87, comparable to traditional classifiers per- formance in closed-world settings. Moreover, Classify-Verify successfully detects adversarial documents where authors deliberately change their style, where closed-world classifiers fail.

Transcript of Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking...

Page 1: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

Classify, but Verify: Breaking the Closed-World Assumption inStylometric Authorship Attribution

Ariel Stolerman, Rebekah Overdorf, Sadia Afroz and Rachel GreenstadtThe Privacy, Security and Automation Lab

Drexel UniversityPhiladelphia, PA

stolerman,rjo43,sa499,[email protected]

Abstract

Forensic stylometry is a form of authorshipattribution that relies on the linguistic infor-mation found in a document. While therehas been significant work in stylometry, mostresearch focuses on the closed-world problemwhere the document’s author is in a knownsuspect set. For open-world problems wherethe author may not be in the suspect set,traditional methods used in classification areineffective. We propose the Classify-Verifymethod, that augments classification with a bi-nary verification step, evaluated on stylomet-

ric datasets, but can be generalized to anydomain. We suggest augmentations to anexisting distance-based authorship verificationmethod, by adding per-feature standard devia-tions and per-author threshold normalization.The Classify-Verify method significantly out-performs traditional classifiers in open-worldsettings (p-val < 0.01) and attains F1-score of0.87, comparable to traditional classifiers per-formance in closed-world settings. Moreover,Classify-Verify successfully detects adversarialdocuments where authors deliberately changetheir style, where closed-world classifiers fail.

Page 2: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

Keywords. Forensic Stylometry, Authorship Attri-bution, Authorship Verification, Abstaining Classifi-cation, Machine Learning.

1 Introduction

The web is full of anonymous communication, withhigh value for digital forensics. For this purpose,forensic stylometry is used to analyze anonymouscommunication in order to “de-anonymize” it. Clas-sic stlyometric analysis requires an exact set of sus-pects in order to perform reliable authorship attri-bution, settings that are often not met in real worldproblems. In this paper we break the closed-worldassumption and explore a novel method for forensicstylometry that copes with the possibility that thetrue author is not in the set of suspects.

Stylometry is a form of authorship recognition thatrelies on the linguistic information found in a doc-ument. While stylometry existed before computersand artificial intelligence, the field is currently domi-nated by AI techniques such as neural networks andstatistical pattern recognition. State-of-the-art sty-lometry approaches can identify individuals in setsof 50 authors with over 90% accuracy (Abbasi andChen, 2008), and even scaled to over 100,000 au-thors (Narayanan et al., 2012). Stylometry is cur-rently used in intelligence analysis and forensics, withincreasing interest for digital communication analy-sis (Wayman et al., 2009). The 2009 Technology As-sessment for the State of the Art Biometrics Excel-lence Roadmap (SABER) commissioned by the FBIstated that, “As non-handwritten communicationsbecome more prevalent, such as blogging, text mes-saging and emails, there is a growing need to identifywriters not by their written script, but by analysisof the typed content (Wayman et al., 2009).” Withthe continuous increase in rigorousness, accuracy andscale of stylometric techniques, law practitioners turnto stylometry for forensic evidence (Chaski, 2013;Juola, 2013), albeit considered controversial at timesand nontrivial to be admitted in court (Clark, 2011).

The effectiveness of stylometry has consider-able implications for anonymous and pseudonymousspeech. Recent work has exposed limits on stylom-etry through active circumvention (Brennan et al.,2012; McDonald et al., 2012). Stylometry has thusfar focused on limited, closed-world models. In theclassic stylometry problem, there are relatively fewauthors (usually fewer than 20, nearly always fewerthan 100), the set of possible authors is known, everyauthor has a large training set and all the text is from

the same genre. However, problems faced in the realworld often do not conform to these restrictions.

Controversial, pseudonymous documents that arepublished on the Internet often have an unboundedsuspect list. Even if the list is known with cer-tainty, training data may not exist for all suspects.Nonetheless, classical stylometry requires a fixed listand training data for each suspect, and an authoris always selected from this list. This is problem-atic both for forensics analysts, as they have no wayof knowing when widening their suspect pool is re-quired, and for Internet activists as well, who mayappear in these suspect lists and be falsely accusedof writing certain documents.

We explore a mixed closed-world and open-worldauthorship attribution problem where we have aknown set of suspect authors, but with some prob-ability (known or unknown) that the author we seekis not in that set. The key contributions of this paperare:

1) We present the Classify-Verify method.This novel method augments authorship classificationwith a verification step, and obtains similar accu-racy on open-world problems as traditional classifiersin closed-world problems. Even in the closed-worldcase, Classify-Verify can improve results by replac-ing wrongly identified authors with “unknown.” Ourmethod can be tuned to different levels of rigidity, toachieve desired false positive and false negative er-ror rates. However, it can be automatically tuned,whether or not we know the expected proportion ofdocuments by authors in the suspect list versus thosethat are absent.

2) Classify-Verify performs better in ad-versarial settings than traditional classifica-tion. Previous work has shown that traditional clas-sification performs near random chance when facedwith writers who change their style. Classify-Verifyfilters out most of the attacks in the Extended-Brennan-Greenstadt Adversarial corpus (Brennanet al., 2012), an improvement over previous workwhich requires training on adversarial data for attackdetection (Afroz et al., 2012).

In addition we present the Sigma Verificationmethod. This method is based on Noecker andRyan’s (Noecker and Ryan, 2012) distractorless ver-ification method, which measures distance betweenan author and a document. Sigma Verification incor-porates pairwise distances within the author’s doc-uments and the standard deviations of the author’sfeatures, and although not proven to statistically out-

Page 3: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

perform the distractorless method always, it is yetshown as a better alternative suitable for datasetswith certain characteristics.

In Sec. 2 we present a formal definition of theclosed-world, open-world, verification, and classify-verify problems. Sec. 3 contextualizes our contri-bution in terms of related work. Sec. 4 describesthe datasets we worked with. In Sec. 5 and Sec. 6we discuss the closed-world classification and binaryverification methodologies used later by our Classify-Verify method, presented in Sec. 7. Sec. 8 presentsthe evaluation of the Classify-Verify method on stan-dard and adversarial datasets. We conclude with adiscussion of interesting results (Sec. 9), how Classify-Verify can be applied in other security and privacydomains and directions for future work (Sec. 10).

2 Problem Statement

2.1 Definitions

In order to understand the problem we explore inthis paper, first we describe the pure closed-worldand open-world stylometry problems.

The closed-world stylometry problem, namely au-thorship attribution, is: given a document D of un-known authorship and documents by a set of knownauthors A = {A1, ..., An}, determine the authorAi ∈ A of D. This problem assumes D’s author isin A. The open-world stylometry problem is: givena document D, identify who its author is. Author-ship verification is a slightly relaxed version (yet veryhard): given a document D and an author A, deter-mine whether D is written by A.

Finally, the problem we explore is a mixture ofthe two above: given a document D of unknown au-thorship and documents by a set of known authorsA, determine the author Ai ∈ A of D, or that D’sauthor is not in A. This problem is similar to theattribution problem, with the addition of the class“unknown.” An extended definition also includesp = Pr[AD ∈ A], the probability that D’s authoris in the set of candidates.

For the rest of the paper, we look at test documentsin two settings: when the authors of these documentsare in the set of suspects, denoted in-set, and whenthese documents are by an author outside the suspectset, denoted not-in-set.

2.2 Hypothetical Scenario

Consider Bob’s workplace which he shares with n−1other employees, under the management of Alice.

Bob gets up to get a cup of coffee, and incautiouslyforgets to lock his computer. When he returns to hisdesk he discovers that a vicious (and sufficiently long)email has been sent in his name to Alice! He quicklygoes to Alice in order to explain, and Alice decidesto check the authorship of the email to assert Bob’sinnocence (or refute it). Luckily Alice has access tothe company’s email database, so she can model thewriting style of her n employees. Unluckily, the secu-rity guard at the door tends to doze off every once ina while, resulting with unauthorized people wander-ing off in the company’s halls, such that the portionof authorized people in the office at any given time isp.

A closed-world system would only be able to con-sider the n employees and identify one of them asthe culprit. This would be problematic if the emailwas written by one of the unauthorized entrants. AClassify-Verify approach would be able to considerthis possibility.

2.3 Problems with Closed-World Models

Applying closed-world stylometry in open-world set-tings suffers from a fundamental flaw: a closed-worldclassifier will always output some author in the sus-pect set. If it outputs an author, it merely meansthe document in question is written in a style moresimilar to that author’s style than the others’, andthe probability estimates of the classifier reflect onlywho is the least-worst choice. Meanwhile, the ab-sence of the document’s author from the set of sus-pects remains unknown. If we relax the precision ofour results to k-accuracy (Narayanan et al., 2012),i.e. target to narrow down our set of suspects to krather than just one, the problem will not be solved– all k options will still be wrong.

This problem becomes prominent especially in on-line domains, where the number of potential sus-pects can be virtually unbounded. Failing to addressthe limitations of closed-world models may result infalsely attributed authors with consequences for boththe forensic analyst and the innocent Internet user.

3 Related Work

3.1 Open-World Classification

Open-world classification deals with scenarios inwhich the set of classes is not known in advance. Ap-proaches include unsupervised, semi-supervised andabstaining classification. Unsupervised stylometryclusters instances based on their feature vector dis-

Page 4: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

tances (Abbasi and Chen, 2008; Koppel et al., 2011a).Semi-supervised methods are proposed to identifyclusters (Sorio et al., 2010) which are later used insupervised classification. Abstaining classifiers re-frain from classification to improve the classifier’s re-liability in certain situations, for example, to min-imize misclassification rate by rejecting the resultswhen the classifier’s confidence is low (Pietraszek,2005; Chow, 1970; Herbei and Wegkamp, 2006). TheClassify-Verify method is an abstaining classifier thatrejects/accepts an underlying classifier’s output usinga verification step based on the distances between thetest author and the predicted author. The primarynovelty of this work is that, unlike other stylometrictechniques, Classify-Verify considers the open-worldsituation where the author may not be in the suspectset.

Another way is to make a model of the closed-worldand reject everything that does not fit it. Althoughapproaches like this are criticized for network intru-sion detection (Sommer and Paxson, 2010), in bio-metric authentication systems distance-based meth-ods for anomaly detection work well (Araujo et al.,2005; Killourhy, 2012; Lee and Cho, 2007).

3.2 Authorship Classification

In authorship classification, one of the authors in afixed suspect set is attributed to the test document.Current stylometry methods achieve over 80% accu-racy with 100 authors (Abbasi and Chen, 2008), over30% accuracy with 10,000 authors (Koppel et al.,2011b), and over 20% precision with 100,000 au-thors (Narayanan et al., 2012). None consider thecase where the true author is missing. Though sty-lometric techniques for classification work well, theycan be easily circumvented by imitating another per-son or deliberate obfuscation (Brennan and Green-stadt, 2009).

3.3 Authorship Verification

In authorship verification we aim to determinewhether a document D is written by an author Aor not. This problem is harder than the closed-worldstylometry discussed above. Authorship verificationis essentially a one-class classification problem, onwhich a reasonable amount of research is done, mostlywith support vector machines (Tax, 2001; Scholkopfet al., 2001; Manevitz and Yousef, 2007). However,little research is done in the domain of stylometry.

Most previous work addresses verification for pla-giarism detection (van Halteren, 2004; Clough, 2000;

Meyer zu Eissen et al., 2007). The unmasking algo-rithm (Koppel et al., 2007) is an example of a gen-eral approach to verification, that relies on measuring“depth-of-difference” between document and authormodels. It reaches 99% accuracy with similar falsepositive and false negative rates, however it is lim-ited to tasks with large training data (Sanderson andGuenter, 2006).

Noecker and Ryan (Noecker and Ryan, 2012)propose the distractorless verification method, thatavoids using negative samples to model the class ofnot the author. They use simplified feature sets con-structed only of character or word n-grams, normal-ized dot-product (cosine distance) and an acceptancethreshold. They evaluate their approach on two cor-pora (Juola, 2004; Potthast et al., 2011), and reportup to 88% and 92% accuracy. Their work providesa verification framework robust across different typesof writings (language, genre or length independent).However, their results also suffer from low F-scoremeasurements (up to 47% and 51%), which suggesta skew in the test data (testing more non-matchingdocument-author pairs than matching ones). A closerlook at this method along with error rates is givenin Sec. 6.2.

The main novelty of this work is in utilizing verifi-cation for elevating closed-world attribution to open-world, which, to the best of our knowledge, has notbeen done before.

4 Corpora

We experiment with two corpora: the Extended-Brennan-Greenstadt (EBG) Adversarial cor-pus (Brennan et al., 2012) and the ICWSM 2009Spinn3r Blog dataset (blog corpus) (Burton et al.,2009).

The EBG corpus contains writings of 45 differentauthors, with at least 6,500 words per author. It alsocontains adversarial documents, where the authorschange their writings style either by hiding it (ob-fuscation attack), or imitating another author (imi-tation attack). Most of the evaluations in this paperare done using the EBG corpus.

The Spinn3r blog corpus, provided by Spinn3r.com,is a set of 44 million blog posts made between August1st and October 1st, 2008. The posts include the textas syndicated, as well as metadata such as the blog’shomepage, timestamps, etc. This dataset has beenpreviously used in internet scale authorship attribu-tion (Narayanan et al., 2012). We use a subcorpusof 50 blogs with at least 7500 words as our blog cor-

Page 5: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

pus. We use the blog corpus as control, evaluatedunder the same settings as the EBG corpus, in orderto avoid overfitting configurations on the latter, andgeneralize our conclusions.

5 Closed-World Setup

Throughout the paper we utilize a closed-world clas-sifier, both for baseline results to evaluate differ-ent methods, and as the underlying classifier for theClassify-Verify method presented in section 7. Weuse Weka’s (Hall et al., 2009) linear kernel sequentialminimal optimization support vector machine (Platt,1998) (SMO SVM) with complexity parameter C =1. SVMs are chosen due to their proven effectivenessfor stylometry (Juola, 2008), specifically for the EBGcorpus (Brennan et al., 2012).

In addition to the classifier selection, another im-portant part of a stylometric analysis algorithm isthe feature set used to quantify the documents, priorto learning and classification. The EBG corpus wasoriginally quantified using the Writeprints featureset (Brennan et al., 2012), based on the Writeprintsalgorithm (Abbasi and Chen, 2008), which is provenaccurate on high number of authors (over 90% ac-curacy for 50 authors). Writeprints uses a complexfeature set that quantifies different linguistic levelsof the text, including lexical, syntactic, and contentrelated features (see original paper for details); how-ever, for simplicity we choose to use a feature setthat consists only of one type of feature. We evalu-ated the EBG corpus using 10-fold cross validation 1

with the k most common word n-grams or charac-ter n-grams, with k from 50 to 1000 with steps of50, and n from 1 to 5 with steps of 1. The most-common feature selection heuristic is commonly usedin stylometry (Abbasi and Chen, 2008; Koppel andSchler, 2004; Noecker and Ryan, 2012) to improveperformance and avoid over-fitting, as are the cho-sen ranges of k and n. F1-score results for charactern-grams are illustrated in Fig. 1.

Of word and character n-grams, the latter out-perform the first, with the best F1-score results at-tained with character bigrams at ≈ 0.93 (for k = 400and above), compared to the best score of 0.879 forwords, attained using n = 1 and k = 1000. Bothfeature sets outperform the original EBG evaluation

1In k-fold cross validation the dataset is randomly di-vided into k equally-sized subsets (folds), where each sub-set is tested against models trained on the other subsets,and the final result is the weighted average of all subsetresults.

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000

F1-s

core

# of most common n-grams taken

char 1-grams char 2-grams char 3-grams

char 4-grams char 5-grams Writeprints

Figure 1: F1-scores for evaluation of the EBG corpususing different character n-grams with varying limitsof the feature set size.

with Writeprints at F1-score of 0.832 (Brennan et al.,2012). Finally, we choose the 500 most common char-acter bigrams as our feature set (at F1-score of 0.928),denoted 〈500, 2〉-chars, used throughout all of our ex-periments. It is chosen for its simplicity, performanceand effectiveness.

For control, we evaluated the effectiveness of us-ing 〈500, 2〉-chars compared to using the Writeprintsfeature set on the blog corpus, via 10-fold cross vali-dation with an SMO SVM classifier. Although bothresults were lower than those obtained for the EBGcorpus, 〈500, 2〉-chars outperformed Writeprints withF1-score of 0.64 versus 0.509, respectively.

Feature extraction for all experiments in the paperis done using the JStylo 2 and JGAAP 3 authorshipattribution frameworks API.

6 Verification

In authorship verification we aim to determinewhether a document D is written by an author A ornot, a problem for which two naıve approaches sug-gest themselves. The first and most intuitive is to re-duce the problem to closed-world settings by creatinga model for not-A (simply from documents not writ-ten by A) and train a binary classifier. This methodsuffers from a fundamental flaw, as if D is attributedto A it merely means D’s style is less distant fromA than it is from not-A [however the opposite direc-

2https://github.com/psal3http://evllabs.com/jgaap/w

Page 6: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

tion, when D is attributed to not-A, is useful (Kop-pel et al., 2007)]. Another approach is to train abinary model of D versus A, and test the model onitself using cross-validation; if D is written by A, weexpect the accuracy to be close to random, due tothe indistinguishability of the models. However thismethod does not work well, and requires D to containa substantial amount of text for cross-validation, anuncommon privilege in real-world scenarios.

In the next sections we discuss and evaluate severalverification methods. The first family of methods isclassifier-induced verifiers, which require an underly-ing (closed-world) classifier and utilize its class prob-abilities output for verification.

The second family of methods is standalone veri-fiers, which rely on a model built using author train-ing data, independent of other authors or classifiers.We evaluate two verification methods: the first isthe distractorless verification method, denoted V , de-veloped by Noecker and Ryan (Noecker and Ryan,2012). It is presented as a baseline as it is a straightforward verification method, proven robust across dif-ferent domains, and does not use a distractor set(model of “not-A”). Then we present the novel SigmaVerification method, which applies adjustments to Vfor increased accuracy: adding per-feature standarddeviations normalization (denoted Vσ) and addingper-author threshold normalization (denoted V a; themethod with both adjustments combined is denotedV aσ ). Finally, we evaluate and compare V with itsnew variants.

6.1 Classifier-Induced Verification

One promising aspect of the closed-world model thatcan be used in open-world scenarios is the confidencein the solution given by distance-based classifiers. Ahigher confidence in an author may, naturally, indi-cate that the author is in the suspect set while a lowerconfidence may indicate that he is not and that thisproblem is, in fact, an open-world situation.

Following classification, verification can be formu-lated simply by setting an acceptance threshold t,measure the confidence of the classifier in its classifi-cation, and accept the classification if and only if itis above t.

Next we discuss several verification schemes, basedon classification probabilities outputted by closed-world classifiers. For each test document D withsuspect authors A = {A1, ..., An}, a classifier pro-duces a list of probabilities PAi

which is, accord-ing to the classifier, the probability D is written by

Ai (∑ni=1 PAi

= 1). We denote the probabilitiesP1, ..., Pn as the reverse order statistic of PAi

, i.e. P1

is the highest probability given to some author (thechosen author), P2 the second highest and so on.

These methods are obviously limited to classify-verify scenarios, as verification is dependent on clas-sification results (thus they are not evaluated in thissection, but rather in Sec. 8 as part of the Classify-Verify evaluation). For this purpose, and in orderto extract the probability measurements required bythe following methods, we use SMO SVM classifierswith the 〈500, 2〉-chars feature set for all of our ex-periments in Sec. 8. We fit logistic regression modelsto the SVM outputs for proper probability estimates.The classifier-induced verification methods we evalu-ate are the following:

1) P1: The first measurement we look at is sim-ply the classifier’s probability output for the chosenauthor, namely P1. The hypothesis behind this mea-surement is that as the likelihood the top author isthe true author increases, relative to all others, sodoes its corresponding probability.

2) P1-P2-Diff : Another measurement we consideris the difference between the classifier’s probabilityoutputs of the chosen and second-to-chosen authors,i.e. P1 − P2, denoted as the P1-P2-Diff method.

3) Gap-Conf : The last classifier-induced methodwe consider is the gap confidence (Paskov, 2010).Here we do not train one SVM classifier; instead, forall n authors, we train corresponding n one-versus-all SVMs. For a given document D, each classifieri in turn produces 2 probabilities: the probabilitythat D is written by Ai and the probability it is not(i.e. the probability it is written by an author otherthan Ai). For each i, denote the probability that D iswritten by Ai as pi(Yes|D); then the gap confidence isthe difference between the highest and second-highestpi(Yes|D), which we denote briefly as Gap-Conf . Thehypothesis is similar to P1-P2-Diff : the probabilityof the true author should be much higher than thatof the second-best choice.

6.2 Standalone Verification

6.2.1 V : Distractorless Verification

In this section we describe the Distractorless ver-ification method, denoted V , proposed by Noeckerand Ryan (Noecker and Ryan, 2012). As discussedin Sec. 3, V uses straight-forward distance combinedwith a threshold: set an acceptance threshold t,model document D and author A as feature vectors,measure the distance between them, and determine

Page 7: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

D is written by A if it is below t.The algorithm begins with preprocessing of the

documents (D and A’s), in which whitespaces andcharacter case are standardized. The authors usecharacter n-grams and word n-grams as feature sets,extracted using a sliding window technique, due totheir known high performance in stylometry, simplic-ity, fast calculation and robustness against errors thatare found in more complex feature extractors, likepart-of-speech taggers. Let n denote the size of thechosen feature set.

Next, a model M = 〈m1,m2, ...,mn〉 is built fromthe centroid of the feature vectors of A’s documents.For each i, mi is the average relative frequency of fea-ture i across A’s documents, where relative frequencyis used to eliminate document length variation effect.In addition, a feature vector F = 〈f1, f2, ..., fn〉 is ex-tracted from D, where fi corresponds to feature i’srelative frequency in D.

Finally, a distance function δ and a threshold t areset, such that if δ(x, y) < δ(x, z), x is considered tobe closer to y than to z. The authors use normalizeddot-product (cosine distance), defined as:

δ(M,F ) =M · F‖M‖‖F‖

=

∑ni=1mifi√∑n

i=1m2i

√∑ni=1 f

2i

as it is shown effective for stylometry (Noecker andJuola, 2009) and efficient for large-scale datasets.Note that the authors define closer to using > in-stead of <, which is consistent with cosine distance(where 1 is perfect match). However, we use < as themore intuitive direction (according to which a smallerdistance means better match), and adjust cosine dis-tance δ in the equation above to 1− δ.

The threshold t is set such that we determine thatD is written by A when δ(M,F ) < t. Ideally, it is em-pirically determined by analysis of the average δ be-tween the author’s training documents. However, al-though mentioned in their paper, they evaluate theirmethod using a hard coded threshold that does notundertake these author-wise δ’s into account (whichV a does, as seen next).

6.2.2 Vσ, V a, V aσ : Sigma Verification

We apply two adjustments to V :1) Vσ: Per-Feature SD Normalization – The

first improvement is based on the variance of the au-thor’s writing style. If an author has rather unvar-ied style, we want a tighter bound for verification,whereas for a more varied style we can loosen the

model to be more accepting. To do so we use thestandard deviation of an author, denoted SD, on aper-feature basis. For each author, we determine theSD of all of his features. When computing distancebetween an author and a document, we divide eachfeature-distance by its SD, so if the SD is smaller,A and D move closer together, otherwise they movefarther apart. This idea is applied in (Araujo et al.,2005) for authentication through typing biometrics,however, to the best of our knowledge, this is thefirst use of this method for stylometric verification.

2) V a: Per-Author Threshold Normalization– The second improvement we offer is to adjust theverification threshold t on a per-author basis, basedon the average pairwise distance between all of theauthor’s documents, denoted δA. V does not takethis into account and instead uses a hard threshold.Using δA to determine the threshold is, intuitively,an improvement because it accounts for how spreadout the documents of an author are. This allows themodel to relax if the author has a more varied style.Similarly to V , this “varying” threshold is still ap-plied by setting a fixed threshold t across all authors,determined empirically over the training set; howeverfor V a every author-document distance measurementδ is adjusted by subtracting δA prior to being com-pared with t, thus allowing per-author thresholds butstill requires the user to set only one fixed thresholdvalue.

Tab. 1 details the differences in distance calcula-tions and threshold test among V , Vσ and V a. Wedenote δD,A as the overall distance measured by somedistance metric δ between the feature vector of docu-ment D and the centroid vector of author A across allhis documents, denoted C(A). In addition we denotethe respective feature level representation of δ as fol-lows: δD,A = ∆(Di, C(A)i)

ni=1, where n is the num-

ber of features (dimension of D and C(A)). Finally,we define σ(A) as the standard deviation vector ofauthor A’s features, and δA as the pairwise distancebetween all of A’s documents.

Note that using V a may derive nonintuitive thresh-olds (e.g. negative thresholds when using cosinedistance, which normally produces values in [0, 1]).However this is only to adjust to the distance shiftfrom δD,A (used in V ) by δA to δD,A − δA used byV a, i.e. it is a byproduct of the per-author thresholdnormalization.

Page 8: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

XXXXXXXXXXDistanceTest

δ < t δ − δA < t

δD,A = ∆(Di, C(A)i)ni=1 V V a

δσD,A = ∆( Di

σ(A)i, C(A)iσ(A)i

)ni=1 Vσ V aσ

Table 1: Differences in distance calculation and t-threshold test for V , Vσ and V a.

6.3 Standalone Verification: Evaluation

We evaluate the methods above on the the EBG cor-pus, and the blog corpus as control. The evaluationis done by examining false positive and false negativeerror rates, visualized via ROC curves. The EBGcorpus is evaluated only on the non-adversarial docu-ments, and the blog corpus is evaluated entirely. Theevaluation is done using 10-fold cross-validation withthe 〈500, 2〉-chars feature set, where author modelsare built using the training documents. In each fold,every test document is tested against every one of theauthors models, including its own (which is obviouslytrained on other documents of the author).

ROC curves for evaluation of V , Vσ and V aσ areillustrated in Fig. 2 and Fig. 3 for the EBG and theblog corpora, respectively.

00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TPR

FPR

V Vσ Vaσ

Figure 2: ROC curves for V , Vσ and V aσ evaluationon the EBG corpus.

On the EBG corpus Vσ significantly outperforms V(and can be seen clearly from FP = 0.05 and above),at a confidence level of p-val < 0.01. V seems to

00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TPR

FPR

V Vσ Vaσ

Figure 3: ROC curves for V , Vσ and V aσ evaluationon the blog corpus.

outperform V aσ up to FP = 0.114, at which on V aσoutperforms V , at a confidence level of p-val < 0.01.However, these results change for the blog control cor-pus, where V significantly outperforms both Vσ andV aσ . One way to explain the differences is by the char-acteristics of the corpora: the EBG is a “cleaner” andmore stylistically consistent corpus, consisting of allEnglish formal writing samples (essays written origi-nally for business or academic purposes), whereas theblog dataset contains less structured and formal lan-guage, which may reduce the distinguishable effects ofstyle variance normalization. This is supported alsoby the increased performance for EBG compared tothe blog corpus (larger AUC). Clearly the results sug-gest that there is no one method preferable over theother, and selecting a verifier for a problem shouldrely on empirical testing over a stylistically similartraining data.

As for the effect of adding the per-author thresh-old adjustments, for both corpora Vσ outperformsV aσ on low FPR until they intersect (at FP = 0.27and FP = 0.22 for the EBG and blog corpora, re-spectively), at which point V aσ begins to outperformVσ. These properties allow various verification ap-proaches to be used per need, dependent on falsepositive and false negative error rates constraints theproblem in hand may impose.

7 Classify-Verify

The main novelty and contribution of this paper is in-troducing the Classify-Verify method. This method

Page 9: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

is an abstaining classifier (Chow, 1970), i.e. a classi-fier that refrains from classifications in certain casesto reduce misclassifications. Classify-Verify combinesclassification with verification, to expand closed-world authorship problems to open-world, by essen-tially adding another class: “unknown”. Another as-pect of the novelty of our approach is the utilizationof abstaining classification methods to upgrade fromclosed-world to open-world, where we evaluate howmethods for thwarting wrongly classified instancesapply on misclassifications that originate in beingoutside the assumed suspect set, rather than simplymissing the true suspect.

First, closed-world classification is applied on thedocument in question, D, and the author suspectset A = {A1, ..., An} (with their sample documents).Then, the output of the classifier Ai ∈ A, is givento the verifier to determine the final output. Feedingonly the classifier result into the verifier utilizes thehigh accuracy attainable by classifiers, which outper-form verifiers in closed-world settings, thus focusingthe verifier only on the top choice author in A (orleast worse choice of all). The verifier determineswhether to accept Ai or reject by returning ⊥, basedon a verification threshold t. Classify-Verify is essen-tially a classifier over the suspect set A ∪ {⊥}.

The threshold t selection process can be automatedwith respect to varying expected portions of in-setand not-in-set documents. We denote the likelihoodof D’s author being in A, the expected in-set docu-ments fraction, as p = Pr[AD ∈ A] (making the like-lihood of the expected not-in-set documents 1 − p).In addition, we use the notation p-〈measure〉, whichrefers to that measure’s weighted average with re-spect to p. For instance, p-F1 is the weighted F1-score, weighted over F1-scores of p expected in-setdocuments and 1− p expected not-in-set documents.t can thus be determined in several ways:

1) Manual: The threshold t can be manually setby the user. The threshold determines the sensitivityof the verifier, so setting t manually allows adjustingit from strict to relaxed, where the stricter it is, theless likely it is to accept the classifier’s output. Thisallows tuning the algorithm to different settings, im-posing the desired rigidity (expressed in limiting falsepositive and false negative error rates).

2) p-Induced Threshold: The threshold can beset empirically over the training set to the one thatmaximizes the target measurement, e.g. F1-score, inan automated process. In case p is given, the al-gorithm will apply cross-validation on the training

data alone using the range of all relevant manuallyset thresholds (at some preset decimal precision ad-vancements) and will choose the threshold that yieldsthe best target measurement. This is essentially ap-plying Classify-Verify recursively on the training dataone level deeper with a range of manual thresholds.The relevant threshold search range is determined au-tomatically by the minimum and maximum distancesobserved in the verify phase of Classify-Verify.

00.10.20.30.40.50.60.70.80.91

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F1

-sco

re

Verification Threshold

in-set not-in-set 0.5-F1

Figure 4: 0.5-F1 results for Classify-Verify on EBGusing SMO SVM and Vσ with varying manually-setthresholds.

Fig. 4 illustrates the automated threshold tuningprocess on the EBG corpus for p = 0.5 (i.e. 50% ex-pected in-set documents). In this example, Classify-Verify uses SMO SVM as classifier, Vσ as verifier andmanually-set thresholds. The target measurementis F1-score, i.e. the automated process would havechosen the threshold that maximizes 0.5-F1 (here att = 0.15). The base assumption is that the thresholdthat maximizes the target measurement on the train-ing set is also the best choice for the testing phase.

An observation that should be noted is that max0.5-F1 outperforms the F1-score at the intersectionof the in-set and not-in-set curves. This is due to cal-culating 0.5-F1 in a “micro” rather than a “macro”fashion: the confusion matrices 4 of in-set and not-in-set are aggregated in a weighted sum and only thenis the weighted 0.5-F1 calculated, across all authors(and ⊥), rather than weighing the in-set and not-in-set F1-scores in a simple weighted average.

4A table in which each column represents the docu-ments in a predicted class and each row represents thedocuments in an actual class

Page 10: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

3) in-set/not-in-set-Robust: If the expectedin-set and not-in-set documents proportion is un-known, we can apply the same idea of the previ-ously described threshold. If we examine the Classify-Verify F1-score curve for some p along a range ofthresholds, as p increases, it favors smaller (more ac-cepting) thresholds, therefore the curve behaves dif-ferently for different values of p; however, all curvesintersect at one t – at which the in-set and not-in-setcurves intersect. This is illustrated in Fig. 5, whichpresents the same results as Fig. 4, only for varyingvalues of p (0.1 to 0.9, with steps of 0.1).

00.10.20.30.40.50.60.70.80.91

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

F1-s

core

Verification Threshold

0.1-F1 0.2-F1 0.3-F1 0.4-F1 0.5-F1

0.6-F1 0.7-F1 0.8-F1 0.9-F1

Figure 5: p-F1 results for Classify-Verify on EBGusing SMO SVM and Vσ with varying manually-setthresholds and varying values of p.

This can be utilized to automatically obtain athreshold robust for any value of p by taking thresh-olds that minimize the difference between p-F1 andq-F1 curves, for any arbitrary p, q ∈ (0, 1) (for sim-plicity we use 0.3 and 0.7; minimizing the differenceis an approximation of the true intersection). As il-lustrated in Fig. 5, the robust threshold does notguarantee the highest measurement; it does, how-ever, guarantee a relatively high expected value ofthat measure, independent of p, and thus robust forany open-world settings. We denote this measure-ment as p-〈measure〉R (for Robust), e.g. p-F1R.

Finally, the entire Classify-Verify algorithm is de-scribed in Alg. 1, and the flow of the algorithm isillustrated in Fig. 6.

𝒜 Train

𝐶𝒜

𝑉𝐴1 𝑉𝐴2 … 𝑉𝐴𝑖 … 𝑉𝐴𝑛

Document 𝐷

𝐴𝑖

Accept Reject

𝑡

Set

𝑡?

𝑝?

Figure 6: The flow of the Classify-Verify method ona test document D and a suspect set A, with optionalthreshold t and in-set portion p.

8 Evaluation and Results

8.1 Evaluation Methodology

8.1.1 Main

In our main experiment we evaluate the Classify-Verify method on the EBG corpus, excluding the ad-versarial documents, and the blog corpus as control.We evaluate the datasets in two settings: when theauthors of the documents under test are in the set ofsuspects (in-set), and when they are not (not-in-set).

Each classification over n authors A = {A1, ..., An}can result with one of n+1 outputs: an author Ai ∈ Aor ⊥ (meaning: “unknown”). Therefore when theverifier accepts, the final result is the author Ai cho-sen by the classifier, and when it rejects, the finalresult is ⊥.

In the evaluation process we credit the Classify-Verify algorithm when the verification step thwartsmisclassifications in in-set settings. For instance, ifD is written by A, classified as B but the verifierreplaces B with ⊥, we consider the result as true.This approach for abstaining classifiers (Herbei andWegkamp, 2006) relies on the fact that we wouldrather be truly told “unknown” than get a wrongauthor.

We evaluate the overall performance with 10-foldscross validation. For each fold experiment, with 9 ofthe folds as training set and 1 fold as test set, weevaluate every test document twice: once as in-setand once as not-in-set.

Page 11: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

Algorithm 1 Classify-Verify

Input: Document D, suspect author set A ={A1, ..., An}, target measurement µOptional: in-set portion p, manual threshold t

Output: AD if AD ∈ A, and ⊥ otherwiseCA ← classifier trained on AVA = {VA1 , ..., VAn} ← verifiers trained on Aif t, p not set then

t ← threshold maximizing p-µR of Classify-Verifycross-validation on Aelse if t not set then

t ← threshold maximizing p-µ of Classify-Verifycross-validation on Aend ifA← CA(D)if VA(D, t) = True then

return Aelse

return ⊥end if

For the classification phase of Classify-Verify overn authors, we train n (n − 1)-class classifiers, whereeach classifier Ci is trained on all authors except Ai.A test document by some author Ai is then classifiedas in-set using one of the n − 1 classifiers that weretrained on Ai training data; for simplicity we chooseCi+1 (and C1 for An). For the not-in-set classifica-tion, we simply use the classifier not trained on Ai,i.e. Ci.

For the verification phase of Classify-Verify, weevaluate several methods: one standalone method(Vσ for the EBG corpus and V for the blog corpus),Gap-Conf , P1 and P1-P2-Diff . We use Vσ for EBGand V for the blog corpus, as these methods out-perform the other standalone methods evaluated percorpus, as discussed in Sec. 6.

The more the verifiers reject, the higher the pre-cision is (as bad classifications are thrown away),however recall decreases (as good classifications arethrown away as well), and vice-versa – higher accep-tance increases recall but decreases precision. There-fore we measure overall performance using F1-score,since it provides a balanced measurement of precisionand recall:

F1-score = 2× precision× recallprecision+ recall

precision =tp

tp+ fp, recall =

tp

tp+ fn

We use the two automatic verification thresholdselection methods discussed in Sec. 7: for the sce-

nario in which the proportion of in-set and not-in-setis known with in-set proportion p = 0.5 (thus thenot-in-set proportion is 1 − p), we use the p-inducesthreshold that maximizes F1-score on the trainingset; for the scenario in which p is unknown, we use therobust threshold configured as described in Sec. 7. Inorder to calculate the F1-score of evaluating the testset, we combine the confusion matrices produced bythe in-set and not-in-set evaluations in a p-weightedaverage matrix, from which weighted F1-score is cal-culated. We denote p-induced F1-scores as p-F1, androbust threshold induced F1-scores evaluated at somep as p-F1R.

The threshold optimization phase of the Classify-Verify method discussed in Sec. 7 is done using 9-foldcross validation with the same experimental settingsas the main experiment. Since F1-score is used toevaluate the overall performance, it is also used asthe target measurement to maximize in the automaticthreshold optimization phase. When p is known, thethreshold that maximizes p-F1 is selected, and whenit is unknown, the robust threshold is selected as theone for which the F1-score of different p’s intersect(arbitrarily set to 0.3-F1 and 0.7-F1).

As baseline we compare F1-scores with 10-foldcross validation results of closed-world classificationusing the underlying classifier, SMO SVM with the〈500, 2〉-chars feature set. We denote p-Base as thebaseline F1-score of the closed-world classifier wherethe in-set proportion is p. It follows that 1-Baseis the performance in pure closed-world settings (i.e.only in-set documents), and for any p ∈ [0, 1],p-Base = p · 1-Base (since for the not-in-set docu-ments, the classifier is always wrong).

8.1.2 Adversarial Settings

To evaluate the Classify-Verify method in adver-sarial settings, we train our models on the non-adversarial documents in the EBG corpus, and testthem on the imitation and obfuscation attack doc-uments to measure how well Classify-Verify thwartsattacks (by returning ⊥ instead of a wrong author).In this context, ⊥ can be considered as either “un-known” or “possible-attack”. We measure 0.5-F1,i.e. how well Classify-Verify performs on attack doc-uments in an open-world scenario, where the verifica-tion threshold is set independent of a possible attack,tuned only to maximize performance on expected in-set and not-in-set document portions of 50% each.

As a baseline we compare results with standardclassification using SMO SVM with the 〈500, 2〉-chars

Page 12: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

feature set.

8.2 Results

Tab. 2 summarizes the different F1-score measure-ments terminology used throughout this section.

Term Meaning

p Portion of in-set documentsp-F1 F1-score of Classify-Verify for p portion

of in-set documentsp-F1R F1-score of Classify-Verify for p portion of

in-set documents, using robust thresholdsp-Base F1-score of closed-world classifier for

p portion of in-set documents1-Base means pure closed-world settingsFor each p ∈ [0, 1], p-Base = p · 1-Base

Table 2: F1-score references terminology.

8.2.1 Main

For the EBG corpus, the baseline closed-world clas-sifier attains 1-Base of 0.928 in perfect in-set set-tings, which follows that 0.5-Base = 0.464. As forthe blog corpus, 1-Base = 0.64 which follows that0.5-Base = 0.32. Fig. 7 illustrates 0.5-F1 resultsof the Classify-Verify method on the EBG corpus,where the authors are equally likely to be in-set ornot-in-set (p = 0.5) and verification thresholds areautomatically selected to maximize 0.5-F1. A simi-lar illustration of 0.5-F1 results for the blog corpusis shown in Fig. 8.

0.749 0.799

0.868 0.873

00.10.20.30.40.50.60.70.80.9

1

Vσ Gap-Conf P1 P1-P2-Diff

F1-s

core

0.5-F10.5-Base 1-Base

Figure 7: 0.5-F1 results of the Classify-Verifymethod on the EBG corpus, where the expectedportion of in-set and not-in-set documents is equal(50%). Classify-Verify attains 0.5-F1 that outper-forms 0.5-Base and even close to 1-Base.

0.869 0.819

0.88 0.869

00.10.20.30.40.50.60.70.80.9

1

V Gap-Conf P1 P1-P2-Diff

F1-s

core

0.5-F10.5-Base 1-Base

Figure 8: 0.5-F1 results of the Classify-Verify methodon the blog corpus, where the expected portionof in-set and not-in-set documents is equal (50%).Classify-Verify attains 0.5-F1 that outperforms both0.5-Base and 1-Base.

For both EBG and the blog corpora, Classify-Verify 0.5-F1 results significantly outperform0.5-Base (the dashed lines), using any of the under-lying verification methods, at a confidence level ofp-val < 0.01.

Furthermore, the results are not only better thanthe obviously bad 0.5-Base, but produce similar re-sults to 1-Base, giving overall 0.5-F1 in open-worldsettings up to ≈ 0.87. For the EBG corpus, moving toopen-world settings only slightly decreases F1-scorecompared to the closed-world classifier performancein closed-world settings (the dotted line), which is areasonable penalty for upgrading to open-world set-tings. However, on the blog corpus, where the initial1-Base is low (at 0.64), Classify-Verify manages toboth upgrade to open-world settings and outperform1-Base. These results suggest that out of the in-setdocuments, many misclassifications were thwarted bythe underlying verifiers, leading to an overall increasein F1-score.

Next, we evaluate the robust threshold selectionscheme. In this scenario, the portion of in-set doc-uments p is unknown in advance. Fig. 9 illustratesp-F1R results for the EBG corpus, where differentp scenarios are “thrown” at the Classify-Verify clas-sifier that uses robust verification thresholds. Simi-lar illustration for the blog corpus are illustrated inFig. 10.

In the robust thresholds scenario using the EBGcorpus, Classify-Verify still significantly outperformsthe respective closed-world classifier (p-Base results)for p < 0.7 with any of the underlying verifiers, at a

Page 13: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

00.10.20.30.40.50.60.70.80.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

F1-s

core

p = In-Set Portion

Vσ Gap-Conf P1 P1-P2-Diff p-Base

Figure 9: p-F1R results of the Classify-Verify methodon the EBG corpus, where the expected portion ofin-set documents p varies from 10% to 90% and is as-sumed to be unknown. Robust p-independent thresh-olds are used for the underlying verifiers. Classify-Verify attains p-F1R that outperforms the respectivep-Base.

confidence level of p-val < 0.01. For the blog corpus,Classify-Verify significantly outperforms p-Base us-ing any of the classifier-induced verifiers for all p, ata confidence level of p-val < 0.01.

Moreover, the robust threshold selection hypothe-sis holds true, and for both corpora all methods (withthe exception of V on the blog corpus) manage toguarantee a high F1-score, at ≈ 0.7 and above, foralmost all values of p. For EBG, at p ≥ 0.7 the in-set portion is large enough so that the overall p-Basebecomes similar to p-F1R. For the blog corpus, usingV fails and performs similar to 0.5-Base.

Of all verification methods, P1-P2-Diff is proventhe most preferable verifier to use, since it consis-tently outperforms the other methods across almostall values of p for both corpora, which implies it isrobust to domain variation.

8.2.2 Adversarial Settings

Evaluated on the EBG corpus under imitation andobfuscation attacks, the baseline closed-world classi-fier attains F1-scores of 0 and 0.044 for the imita-tion and obfuscation attack documents, respectively.These results mean that the closed-world classifier ishighly vulnerable to these types of attacks. Fig. 11illustrates F1-scores for Classify-Verify on the attackdocuments. Note that all attack documents are writ-ten by in-set authors, and thus addressed as in-setdocuments.

The results suggest that Classify-Verify success-

00.10.20.30.40.50.60.70.80.91

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

F1-s

core

p = In-Set Portion

V Gap-Conf P1 P1-P2-Diff p-Base

Figure 10: p-F1R results of the Classify-Verifymethod on the blog corpus, where the expectedportion of in-set documents p varies from 10% to90% and is assumed to be unknown. Robust p-independent thresholds are used for the underlyingverifiers. Classify-Verify attains p-F1R that outper-forms the respective p-Base.

fully manages to thwart the majority of the attacks,with up to 0.874 and 0.826 F1-scores for the obfusca-tion and imitation attacks, respectively. These re-sults are very close to the deception detection re-sults reported in (Afroz et al., 2012), with F1-scoresof 0.85 for obfuscation and 0.895 for imitation at-tacks. A major difference is that here these resultsare obtained in open-world settings, with thresholdconfiguration that does not take inside-attacks underconsideration. Moreover, as opposed to the methodsapplied in (Afroz et al., 2012), no attack documentswere used as training data.

Interestingly, the results above are obtained for astandard p = 0.5 open-world scenario, without possi-ble attacks in mind, yet the overall results are harmedlittle to not at all, depending on the underlying ver-ifier. For instance, when using Gap-Conf , 0.5-F1 isat 0.799 in non-attack scenarios, and F1-scores are0.784-0.826 when under attack.

9 Discussion

In this section we discuss interesting observationsfrom the results.

In Sec. 6 we examined two families of veri-fiers, classifier-induced and standalone, later used byClassify-Verify. The results suggest that classifier-induced verifiers consistently outperform the stan-dalone ones; however this trend may be limited tolarge datasets with many suspect authors in the un-derlying classifier, like those evaluated in this paper,

Page 14: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

0.367

0.826

0.612 0.693

0.809 0.784 0.874 0.874

00.10.20.30.40.50.60.70.80.9

1

Vσ Gap-Conf P1 P1-P2-Diff

F1-s

core

Imitation Obfuscation

Figure 11: F1-score results of the Classify-Verifymethod on the EBG corpus under imitation andobfuscation attacks, where the expected portion ofin-set and not-in-set documents is equal (50%).Classify-Verify successfully thwarts most of the at-tacks.

on which classifier-induced verifications rely. It maybe the case that on small author sets, standalone ver-ifiers will perform better, and this direction shouldbe considered in future work. Moreover, the stan-dalone verifiers presented here provide a reasonableaccuracy, which can be used in pure one-class set-tings, where no training data exists except that ofthe true author (a scenario in which classifier-inducedmethods are useless).

The Classify-Verify 0.5-F1 results on both theEBG and the blog corpora illustrated in Fig. 7and Fig. 8 suggest that using P1 or P1-P2-Diff asthe underlying verification method provide domain-independent results, at 0.5-F1 of ≈ 0.87. The su-periority of P1-P2-Diff is emphasized by the p-F1Rresults illustrated in Fig. 9 and Fig. 10, where p-F1Rover 0.7 is obtained for both corpora, independentof p. Therefore P1-P2-Diff is proven as a robust,domain and in-set/not-in-set proportion independentverification method to be used with Classify-Verify.

Finally, Classify-Verify is shown effective in ad-versarial settings, where it outperformed the tradi-tional closed-world classifier, without the requirementof training on adversarial data, like required in (Afrozet al., 2012). Furthermore, no special threshold tun-ing is needed to achieve this protection, i.e. we canuse the standard threshold selection schemes for non-adversarial settings and still thwart most attacks. Itfollows that results in adversarial settings can poten-tially be improved, if p is tuned not to the likelihoodof in-set documents, but to the likelihood of an at-

tack.

10 Conclusion

From a forensics perspective, the possibility of au-thors outside the suspect set makes the use of closed-world classifiers unreliable. In addition, whether lin-guistic authorship attribution can handle open-worldscenarios has important privacy implications for boththe authors of anonymous texts and those likely to befalsely implicated by faults in such systems. This re-search shows that when the closed-world assumptionis violated, traditional stylometric approaches fail un-gracefully.

The Classify-Verify method proposed in this workis not only able to handle open-world settings wherethe author of a document may not be in the train-ing set, but can also improve results in closed-worldsettings, by abstaining from low-confidence classifica-tion decisions. Furthermore, this method is able tofilter out attacks, as demonstrated on the adversarialsamples in the EBG corpus. In all these settings, themethod is able to replace wrong assertions with morehonest and useful statements of “unknown.”

We conclude that the Classify-Verify method ispreferable over the standard underlying closed-worldclassifier. This statement is true regardless of theexpected in-set/not-in-set ratio of the data, and inadversarial settings as well. Since the Classify-Verifyalgorithm is general, it can be applied with any set ofstylometric classifiers and verifiers. In the forensicscontext, when satisfactorily-rigorous techniques areused, the Classify-Verify method is an essential toolfor forensic analysts facing the often case of open-world problems.

We propose several directions for future work:1) Other classification-related forensics and

privacy applications. An important characteris-tic of the Classify-Verify algorithm is its generality,which makes it applicable to other problems. For in-stance, it can be used for behavioral biometrics, likein authentication systems that may have a set of po-tential users (e.g. employees in an office) but shouldconsider outside attacks; the Active Linguistic Au-thentication dataset (Juola et al., 2013) is a perfectcandidate dataset for such application.

2) Fusion of verification methods. Having var-ious verification methodologies can be used to com-bine decisions, for instance using the Chair-Varshneyoptimal fusion rule (Chair and Varshney, 1986), intoa centralized verification method that outperformsits single components. This approach is motivated

Page 15: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

by (Ali and Pazzani, 1995), which shows that greaterreduction in error rates is achieved when the verifiersare distinctly different – here to be applied by usingboth standalone and classifier-induced verifiers.

3) Utilization of Classify-Verify for scala-bility. In cases where the analyst is faced with aclosed-world yet large problem, Classify-Verify maybe used to apply a divide-and-conquer approach, toincrease the potential accuracy of the classifier in-hand. Instead of building a model trained over allsuspect authors, many small Classify-Verify modelscan be trained. Each test document will be classi-fied by all models, and all suspects for which theirmodel returns ⊥ are immediately omitted. All resultauthors returned by the other models will compete ina final round, by a Classify-Verify model (or simpleclosed-world one) trained on all of them. Since eachphase of this problem formulation is much smallerthan one big model, this approach is initially prone tohigher success, and can potentially increase the accu-racy attainable by forensic analysts in large problemdomains.

References

[Abbasi and Chen, 2008] Abbasi, A. and Chen, H.(2008). Writeprints: A stylometric approach toidentity-level identification and similarity detectionin cyberspace. ACM Trans. Inf. Syst., 26(2):1–29.

[Afroz et al., 2012] Afroz, S., Brennan, M., andGreenstadt, R. (2012). Detecting hoaxes, frauds,and deception in writing style online. In Proceed-ings of the 33rd conference on IEEE Symposium onSecurity and Privacy. IEEE.

[Ali and Pazzani, 1995] Ali, K. M. and Pazzani, M. J.(1995). On the link between error correlation anderror reduction in decision tree ensembles. Cite-seer.

[Araujo et al., 2005] Araujo, L. C., Sucupira Jr, L. H.,Lizarraga, M. G., Ling, L. L., and Yabu-Uti, J. B.(2005). User authentication through typing bio-metrics features. Signal Processing, IEEE Trans-actions on, 53(2):851–855.

[Brennan et al., 2012] Brennan, M., Afroz, S., andGreenstadt, R. (2012). Adversarial stylometry:Circumventing authorship recognition to preserveprivacy and anonymity. ACM Trans. Inf. Syst. Se-cur., 15(3):12:1–12:22.

[Brennan and Greenstadt, 2009] Brennan, M. andGreenstadt, R. (2009). Practical attacks against

authorship recognition techniques. In Proceedingsof the Twenty-First Innovative Applications of Ar-tificial Intelligence Conference.

[Burton et al., 2009] Burton, K., Java, A., and Sobo-roff, I. (2009). The icwsm 2009 spinn3r dataset.In Proceedings of the Third Annual Conference onWeblogs and Social Media (ICWSM 2009), SanJose, CA.

[Chair and Varshney, 1986] Chair, Z. and Varshney,P. (1986). Optimal data fusion in multiple sensordetection systems. Aerospace and Electronic Sys-tems, IEEE Transactions on, AES-22(1):98–101.

[Chaski, 2013] Chaski, C. E. (2013). Best practicesand admissibility of forensic author identification.JL & Pol’y, 21:333–725.

[Chow, 1970] Chow, C. (1970). On optimum recogni-tion error and reject tradeoff. Information Theory,IEEE Transactions on, 16(1):41–46.

[Clark, 2011] Clark, A. (2011). Forensic stylometricauthorship analysis under the daubert standard.Available at SSRN 2039824.

[Clough, 2000] Clough, P. (2000). Plagiarism in nat-ural and programming languages: an overview ofcurrent tools and technologies.

[Hall et al., 2009] Hall, M., Frank, E., Holmes, G.,Pfahringer, B., Reutemann, P., and Witten, I.(2009). The weka data mining software: an update.ACM SIGKDD Explorations Newsletter, 11(1):10–18.

[Herbei and Wegkamp, 2006] Herbei, R. andWegkamp, M. H. (2006). Classification withreject option. Canadian Journal of Statistics,34(4):709–721.

[Juola, 2004] Juola, P. (2004). Ad-hoc authorship at-tribution competition. In Proc. 2004 Joint Inter-national Conference of the Association for Liter-ary and Linguistic Computing and the Associationfor Computers and the Humanities (ALLC/ACH2004), Goteborg, Sweden.

[Juola, 2008] Juola, P. (2008). Authorship attribu-tion. Foundations and Trends in information Re-trieval, 1(3):233–334.

[Juola, 2013] Juola, P. (2013). Stylometry and immi-gration: A case study. JL & Pol’y, 21:287–725.

[Juola et al., 2013] Juola, P., Noecker, Jr., J., Stoler-man, A., Ryan, M. V., Brennan, P., and Green-stadt, R. (2013). A dataset for active linguistic au-thentication. In Proceedings of the Ninth Annual

Page 16: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

IFIP WG 11.9 International Conference on DigitalForensics, Orlando, Florida, USA. National Centerfor Forensic Science.

[Killourhy, 2012] Killourhy, K. S. (2012). A scientificunderstanding of keystroke dynamics. Technicalreport, DTIC Document.

[Koppel et al., 2011a] Koppel, M., Akiva, N., Der-showitz, I., and Dershowitz, N. (2011a). Unsu-pervised decomposition of a document into autho-rial components. In Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies-Volume 1,pages 1356–1364. Association for ComputationalLinguistics.

[Koppel and Schler, 2004] Koppel, M. and Schler, J.(2004). Authorship verification as a one-class clas-sification problem. In Proceedings of the twenty-first international conference on Machine learning,ICML ’04, pages 62–, New York, NY, USA. ACM.

[Koppel et al., 2011b] Koppel, M., Schler, J., andArgamon, S. (2011b). Authorship attribution inthe wild. Language Resources and Evaluation,45(1):83–94.

[Koppel et al., 2007] Koppel, M., Schler, J., andBonchek-Dokow, E. (2007). Measuring differen-tiability: Unmasking pseudonymous authors. J.Mach. Learn. Res., 8:1261–1276.

[Lee and Cho, 2007] Lee, H.-j. and Cho, S. (2007).Retraining a keystroke dynamics-based authentica-tor with impostor patterns. Computers & Security,26(4):300–310.

[Manevitz and Yousef, 2007] Manevitz, L. andYousef, M. (2007). One-class document classi-fication via neural networks. Neurocomputing,70(7):1466–1481.

[McDonald et al., 2012] McDonald, A., Afroz, S.,Caliskan, A., Stolerman, A., and Greenstadt, R.(2012). Use fewer instances of the letter ”i”: To-ward writing style anonymization. In Privacy En-hancing Technologies Symposium (PETS).

[Meyer zu Eissen et al., 2007] Meyer zu Eissen, S.,Stein, B., and Kulig, M. (2007). Plagiarism detec-tion without reference collections. In Decker, R.and Lenz, H.-J., editors, Advances in Data Anal-ysis, Studies in Classification, Data Analysis, andKnowledge Organization, pages 359–366. SpringerBerlin Heidelberg.

[Narayanan et al., 2012] Narayanan, A., Paskov, H.,Gong, N., Bethencourt, J., Stefanov, E., Shin, R.,and Song, D. (2012). On the feasibility of internet-scale author identification. In Proceedings of the33rd conference on IEEE Symposium on Securityand Privacy. IEEE.

[Noecker and Juola, 2009] Noecker, Jr., J. and Juola,P. (2009). Cosine distance nearest-neighbor classi-fication for authorship attribution. In Digital Hu-manities 2009, College Park, MD.

[Noecker and Ryan, 2012] Noecker, Jr., J. and Ryan,M. (2012). Distractorless authorship verification.In Proceedings of the Eight International Con-ference on Language Resources and Evaluation(LREC’12), Istanbul, Turkey. European LanguageResources Association (ELRA).

[Paskov, 2010] Paskov, H. S. (2010). A regulariza-tion framework for active learning from imbalanceddata. PhD thesis, Massachusetts Institute of Tech-nology.

[Pietraszek, 2005] Pietraszek, T. (2005). Optimizingabstaining classifiers using roc analysis. In Pro-ceedings of the 22nd international conference onMachine learning, pages 665–672. ACM.

[Platt, 1998] Platt, J. (1998). Fast training of sup-port vector machines using sequential minimal op-timization. In Schoelkopf, B., Burges, C., andSmola, A., editors, Advances in Kernel Methods- Support Vector Learning. MIT Press.

[Potthast et al., 2011] Potthast, M. et al. (2011). Pan2011 lab: Uncovering plagiarism, authorship,and social software misuse. Conference CFP athttp://pan.webis.de/.

[Sanderson and Guenter, 2006] Sanderson, C. andGuenter, S. (2006). Short text authorship attribu-tion via sequence kernels, markov chains and au-thor unmasking: an investigation. In Proceedingsof the 2006 Conference on Empirical Methods inNatural Language Processing, EMNLP ’06, pages482–491, Stroudsburg, PA, USA. Association forComputational Linguistics.

[Scholkopf et al., 2001] Scholkopf, B., Platt, J. C.,Shawe-Taylor, J. C., Smola, A. J., and Williamson,R. C. (2001). Estimating the support of ahigh-dimensional distribution. Neural Comput.,13(7):1443–1471.

[Sommer and Paxson, 2010] Sommer, R. and Paxson,V. (2010). Outside the closed world: On using ma-chine learning for network intrusion detection. In

Page 17: Classify, but Verify: Breaking the Closed-World Assumption ... · Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution Ariel Stolerman,

Security and Privacy (SP), 2010 IEEE Symposiumon, pages 305–316. IEEE.

[Sorio et al., 2010] Sorio, E., Bartoli, A., Davanzo, G.,and Medvet, E. (2010). Open world classificationof printed invoices. In Proceedings of the 10th ACMsymposium on Document engineering, pages 187–190. ACM.

[Tax, 2001] Tax, D. M. (2001). One-class classifica-tion. PhD thesis.

[van Halteren, 2004] van Halteren, H. (2004). Lin-guistic profiling for authorship recognition andverification. In Proceedings of the 42nd Meet-ing of the Association for Computational Lin-guistics (ACL’04), Main Volume, pages 199–206,Barcelona, Spain.

[Wayman et al., 2009] Wayman, J., Orlans, N., Hu,Q., Goodman, F., Ulrich, A., and Valencia, V.(2009). Technology assessment for the state of theart biometrics excellence roadmap. MITRE Tech-nical Report Vol. 2, FBI.