ISSA International Conference 2015 - c.ymcdn.com · Paul Herrmann, CISSP, CISA, ENCE, CPP 2015 ISSA...

13
08/24/15 1 2015 ISSA INTERNATIONAL CONFERENCE N-Gram Analysis In Suspect Author Identification of Anonymous Email Copyright 2015, eVestigations Inc., All Rights Reserved Paul Herrmann, CISSP, CISA, ENCE, CPP 2015 ISSA INTERNATIONAL CONFERENCE Problem Description Computational Linguistics as a Possible Solution N-Grams Solution Development Testing and Verification Problem Resolution Appropriate Usage Future Work Summary N-Gram Analysis In Suspect Author Identification of Anonymous Email Outline Page 2 2015 ISSA INTERNATIONAL CONFERENCE I HATE XXXXX with a passion. I dislike everything about the company. The work…my pay, my hours, my co workers and my STUPID supervisor. I am exhausted EVERY single day. And on top of this I have to work a SECOND job at Emory. I do not have a fucking life and this is pissing me off. I have been doing this tired shit for YEARS! I did not go to college in Mississippi to move to Georgia and settle for this bullshit. Anyway the reason for this message is concerning the work ethic at Xxxxx. I work with the most disrespectful people I have ever met in my life. They are very loud, very rude, and ignorant. In addition, they are very unprofessional on ALL levels. They do things like xxxx xxxxxxx everyday and hide or throw them away to cover up themselves. But what gets to me is the constant nagging by my co workers. It has gotten to the point that I dread the WORD Xxxxx. I have been thinking about confronting several employees telling them to be respectful to me when they speak but they are so rude, it wouldn’t seem to work. So I have decided that I will just bring my Gun to work next week. Look like it’s gonna be another Columbine very SOON if these motherfuckers dont behave. Large Company - Anonymous Compliance Portal - Threat Problem Description Page 3

Transcript of ISSA International Conference 2015 - c.ymcdn.com · Paul Herrmann, CISSP, CISA, ENCE, CPP 2015 ISSA...

08/24/15

1

2015 ISSA INTERNATIONAL CONFERENCE

N-Gram AnalysisIn Suspect Author Identification

of Anonymous Email

Copyright 2015, eVestigations Inc., All Rights Reserved

Paul Herrmann, CISSP, CISA, ENCE, CPP

2015 ISSA INTERNATIONAL CONFERENCE

• Problem Description

• Computational Linguistics as a Possible Solution

• N-Grams

• Solution Development

• Testing and Verification

• Problem Resolution

• Appropriate Usage

• Future Work

• Summary

N-Gram Analysis In Suspect Author Identification of Anonymous Email

Outline

Page 2

2015 ISSA INTERNATIONAL CONFERENCE

I HATE XXXXX with a passion. I dislike everything about the company. The work…my pay, my hours, my co workers and my STUPID supervisor. I am exhausted EVERY single day. And on top of this I have to work a SECOND job at Emory. I do not have a fucking life and this is pissing me off. I have been doing this tired shit for YEARS! I did not go to college in Mississippi to move to Georgia and settle for this bullshit.

Anyway the reason for this message is concerning the work ethic at Xxxxx. I work with the most disrespectful people I have ever met in my life. They are very loud, very rude, and ignorant. In addition, they are very unprofessional on ALL levels. They do things like xxxx xxxxxxx everyday and hide or throw them away to cover up themselves.

But what gets to me is the constant nagging by my co workers. It has gotten to the point that I dread the WORD Xxxxx. I have been thinking about confronting several employees telling them to be respectful to me when they speak but they are so rude, it wouldn’t seem to work.

So I have decided that I will just bring my Gun to work next week. Look like it’s gonna be another Columbine very SOON if these motherfuckers dont behave.

Large Company - Anonymous Compliance Portal - Threat

Problem Description

Page 3

08/24/15

2

2015 ISSA INTERNATIONAL CONFERENCE

• Five communications of similar length

• Over two week period

• Specific threat

• Various sources including anonymous portal and anonymous email

• Communications contained misinformation

• IP tracing led to dead ends

Threat Communications

Problem Description

Page 4

2015 ISSA INTERNATIONAL CONFERENCE

• Suspect familiar with organization

• Aware of anonymous compliance employee portal

• Suspect knowledgeable of IP tracing/obscuring techniques

Knowns

Problem Description

Page 5

2015 ISSA INTERNATIONAL CONFERENCE

Problem Description

Page 6

08/24/15

3

2015 ISSA INTERNATIONAL CONFERENCE

• Threat Emails

• Reasonable Suspicion that Author is an Employee

• MS Exchange Server of Employee Emails

Have

Problem Description

Page 7

Given an unknown writing, identify the author from a fixed population of writing.

Problem Description

2015 ISSA INTERNATIONAL CONFERENCE

• Masters Thesis - All information can be represented as 1’s and 0’s, Alfred Noble Prize from the American Society of Civil Engineers in 1940.

• “A Mathematical Theory of Communication”, 1948, Bell Laboratories, Claude Shannon (coined the term “bit”)

• “Prediction and Entropy of the English Language”, 1951, The Bell System Technical Journal

• Information entropy & N-Grams, core components of computation linguistics emerge in 1950’s after failed attempts at computer language translation

• Computational linguistics became a sub-division of artificial intelligence in the 1960’s with emphasis on speech recognition and machine comprehension and soon disappeared from the forefront of research …

Computational Linguistics

Computational Linguistics as a Possible Solution

Page 8

Claude Shannon (1916‐2001)The Father of Information Theory

2015 ISSA INTERNATIONAL CONFERENCE

• N stands for the number of consecutive lexical units (words, letters etc.) for which the frequency are calculated from a known body of texts (the “training texts”)

• 3-gram example (“Now is the time that…”)

Now is the 92is the time 13the time that 32

What is an N-Gram Model?

Computational Linguistics as a Possible Solution

Page 9

3‐gram data – Google Web N‐Gram Corpus

ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43ceramics collection at 52ceramics collection is 68ceramics collection of 76ceramics collection | 59ceramics collections , 66ceramics collections . 60ceramics combined with 46

08/24/15

4

2015 ISSA INTERNATIONAL CONFERENCE

• In late 1950’s N-Gram models (as well as other statistical models) were heavily criticized by Noam Chomsky, considered by many the father of modern linguistics.

• Formal grammar: finite sets of production rules, nonterminal symbols, terminal symbols and a starting symbol.

• Chompsky Hierarchy, 4-types of grammars.

• By 1960’s N-Grams all but disappear from linguistic research

The Battle

Computational Linguistics as a Possible Solution

Page 10

Noam ChompskyThe Father of Modern Linguistics

2015 ISSA INTERNATIONAL CONFERENCE

• An N-Gram model predicts the likelihood of the next letter(or word) based on only the sequence of the preceding N-1 letters (or Words) without regard to the “rules of grammar”

• N-Grams returned in the mid-1980’s

• By 1990’s well entrenched - Google - Search Engines

• N-gram models have been extremely effective in modeling language applications

The Issue

Computational Linguistics as a Possible Solution

Page 11

© 2005 Ryan North

2015 ISSA INTERNATIONAL CONFERENCE

• Search optimization

• Find likely candidates for the correct spelling of a misspelled word

• Improve compression in compression algorithms

• Assess the probability of a given word sequence in speech recognition and optical character recognition software

• Identifying similar documents

• Improve hashing and information retrieval performance

• Identify text language or creating random language-like text

• Identify species from DNA

• Associate text of unknown authorship to an author

Applications

N-Grams

Page 12

08/24/15

5

2015 ISSA INTERNATIONAL CONFERENCE

Number of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of fivegrams: 1,176,470,663

Google Web N-Gram Corpus ”All Our N-gram are Belong to You”

N-Grams

Page 13

2015 ISSA INTERNATIONAL CONFERENCE

The Lineup

Solution Development

Page 14

Can we develop a system to use N‐grams to pick out writing similar to sample texts from a known population of text?

In other words, can we use N‐grams to match the writing style of the threat letters to a known author contained in the MS Exchange database?

2015 ISSA INTERNATIONAL CONFERENCE

“Author Identification Using Imbalanced and Limited Training Texts”, Efstathios Stamatatos, University of the Aegean

• Calculate the N-Grams for the unattributed texts

• Calculate the N-Grams for known author texts

• Calculate the “distance” of the N-Grams from one another

• Suggest N = 3 to 5 yield best results

• Issue: “distance” measure is sensitive to the text sizes to be relatively similar

Research

Solution Development

Page 15

08/24/15

6

2015 ISSA INTERNATIONAL CONFERENCE

“Effective Identification of Source Code Authors Using Byte-Level Information”, Frantzeskou, Stamatatos, Gritzalis, Katsikas, University of the Aegean

Deals with limited text sizes

Replaces “distance” calculation with a count of matched N-grams

Accuracy reported between 94% and 100%

Research

Solution Development

Page 16

2015 ISSA INTERNATIONAL CONFERENCE

• Normally would compare N-grams of unknown author text against a composite of all author writing. (Who writes like this?”)

• Instead compare N-grams of threat text against individual texts of authors (“Who wrote an email like this?”)

Design Issue #1 – Writing is Intentionally Stylized

Solution Development

Page 17

2015 ISSA INTERNATIONAL CONFERENCE

• Retrieve email from “Sent” Folder

• Parse off previous non-author email text (Forwarded chains)

• Parse off signatures and boilerplate

• If remaining text is below minimum length discard

Design Issue #2 – Email Contains Non-Author Text

Solution Development

Page 18

08/24/15

7

2015 ISSA INTERNATIONAL CONFERENCE

• Develop function to load and parse email archives

• Develop function to eliminate non-author text

• Develop (compiler-like) Lexical Analysis routine generating tokens

• Develop N-Gram calculation routine

• Develop N-Gram comparison routine

• Develop testing routine

Development Steps

Solution Development

Page 19

2015 ISSA INTERNATIONAL CONFERENCE

Solution Development

Page 20

2015 ISSA INTERNATIONAL CONFERENCE

Solution Development

Page 21

08/24/15

8

2015 ISSA INTERNATIONAL CONFERENCE

Solution Development

Page 22

2015 ISSA INTERNATIONAL CONFERENCE

Solution Development

Page 23

2015 ISSA INTERNATIONAL CONFERENCE

Solution Development

Page 24

08/24/15

9

2015 ISSA INTERNATIONAL CONFERENCE

• Load all Enron PST

• Dedupe and cleanse non-author text

• Calculated 5-Grams for all

• Perform 20 sets of:

• Perform 100 trials:

• Randomly pick an one email author (mailbox)

• Randomly pick an email from the mailbox

• Compare all email 5-grams to the selected email

• If email having the most 5-grams in common is from the same author, then record “success” else “fail”

Testing Protocol

Solution Development

Page 25

2015 ISSA INTERNATIONAL CONFERENCE

Solution Development

Page 26

2015 ISSA INTERNATIONAL CONFERENCE

79.2% Accuracy with Standard Deviation of 4.2%

Testing and Verification

Page 27

08/24/15

10

2015 ISSA INTERNATIONAL CONFERENCE

• Investigated suspect

• Poor performance report

• Former employer reported similar threats prior to employee’s separation

• Employee confessed and resigned

Identified Suspect

Problem Resolution

Page 28

2015 ISSA INTERNATIONAL CONFERENCE

1. Suspect Pool Identification

2. Trial Evidence

A. Current testing results clearly place the methodology in the “more likely than not” category at best.

B. Have been used in British courts. In British and Australian court systems it is the Expert rather than the method that is recognized (See [1])

C. USA Courts - Daubert barrier

N-Gram Anonymous Author Usage Cases

Appropriate Usage

Page 29

2015 ISSA INTERNATIONAL CONFERENCE

1. Empirical testing: whether the theory or technique is falsifiable, refutable, and/or testable.

2. Whether it has been subjected to peer review and publication.

3. The known or potential error rate.

4. The existence and maintenance of standards and controls concerning its operation.

5. The degree to which the theory and technique is generally accepted by a relevant scientific community.

Daubert Criteria

Appropriate Usage

Page 30

08/24/15

11

2015 ISSA INTERNATIONAL CONFERENCE

• Coulthard, [1] argues that the N-gram model as well as other statistical methods meet the Daubert criteria.

• Tiersma and Solan, [5] argue that in most cases expert testimony is excluded, but the documents in question are admitted leaving the jury to discern without expert guidance their meaning and authorship

Linguists Argue Daubert Criteria is Adequately Met

Appropriate Usage

Page 31

2015 ISSA INTERNATIONAL CONFERENCE

• United States v. Van Wyk (83 F. Supp. 2d 515 (D. N.J. 2000))

“Although Fitzgerald [the FBI agent offered as a stylistics expert] employed a particular methodology that may be subject to testing, neither Fitzgerald nor the Government has been able to identify a known rate of error, establish what amount of samples is necessary for an expert to be able to reach a conclusion as to probability of authorship, or pinpoint any meaningful peer review. Additionally, as Defendant argues, there is no universally recognized standard for certifying an individual as an expert in forensic stylistics. (83 F.Supp.2d at 522)” [5]

Daubert Criteria Challenge

Appropriate Usage

Page 32

2015 ISSA INTERNATIONAL CONFERENCE

• JonBenét Ramsey, (1996) Ransom Note (See [5])

• Professor Donald Foster, using stylistic analysis first attributed the note to someone who did not write it.

• Later, Professor Foster changed his position and determined that Mrs. Ramsey had written the note.

• “Such incidents help to justify the law's concern about methodology.” [5]

Daubert Criteria Challenge

Appropriate Usage

Page 33

08/24/15

12

2015 ISSA INTERNATIONAL CONFERENCE

• More case studies

• Better testing to isolate best statistical approach and parameters

• Implement Linguistic Analysis techniques to determine:

• Nationality and region affiliations of author

• Sex of author

• Ethnicity of author

• Education level of author

• Hostility level of author

• Utilize MIT Simile project http://simile.mit.edu/wiki/NGram

Future Work

Page 34

2015 ISSA INTERNATIONAL CONFERENCE

• N-gram analysis of anonymous email authorship has been successfully applied

• The methodology is testable

• The current error rate creates a “reasonable doubt” as to authorship

• The methodology can be an aid in identifying suspects

Summary

Page 35

2015 ISSA INTERNATIONAL CONFERENCE

[1] M. Coulthard, “Author Identification, Idiolect and Linguistic Uniqueness”, Applied Linguistics, 2004, 13-14

[2] E. Stamatatos, “Author Identification Using Imbalanced and Limited Training Texts”, University of the Aegean, pp. 1-7

[3] C. Chaski, “Empirical Evaluations of Language-based Author Identification Techniques”, Forensic Linguistics, 2001, 1-65

[4] S. Banerjee, T. Pederson, “The Design, Implementation and Use of the NgramStatistics Package”, Carnegie Mellon University, 1-12

[5] P. Tiersma, L. Solan, “The Linguist on the Witness Stand: Forensic Linguistics in American Courts”. Language, Vol. 78, No. 2 (Jun 2002), pp. 221-239

[6] C. Shannon, “A Mathematical Theory of Communication”, The Bell System Technical Journal, Vol. 27, pp. 379-423, 633-656, July, October 1948

[7] C. Shannon, “Prediction and Entropy of Printed English”, The Bell System Technical Journal, Vol. 30, pp 50-64, January 1951

References

Page 36

08/24/15

13

2015 ISSA INTERNATIONAL CONFERENCE

Paul Herrmann, EnCE, CISSP, CISA, CPP

[email protected]

Contact

Page 37