Download - Understanding Email Traffic

Understanding email trafficDavid Graus, University of Amsterdam [email protected] @dvdgrs

Dec. 12, 2014 - Frontiers of Forensic Science 2

Some background…

• PhD candidate at ILPS • Information Extraction & Retrieval

• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery


Some background…

• PhD candidate at ILPS • Information Extraction & Retrieval

• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery


Information Retrieval?


Information Retrieval?

Ò Finding material of unstructured nature from large collections


Information Extraction?

Ò Text mining Ò Discovering patterns in text data


Semantic Search in E-Discovery?


Semantic Search?


E-Discovery?

• Retrieving and securing digital forensic evidence


E-Discovery

⬜ Semantic Search in E-Discovery


Semantic Search in E-Discovery

• Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web • (Google won’t help us here)


Search in E-Discovery¢ Finding out who knew what, from whom, and when¢ We don’t know what we’re looking for¢ What we’re looking for might be deliberately hidden¢ Communication might be very domain-specific,

contextualized or incomplete


Approach¢ Generic search is not the answer

¢ Google: high precision search¢ E-Discovery: high recall & exploratory search


Tasks¢ Support iterative search¢ Support (re)formulating questions and hypotheses¢ Retrieve all relevant traces


Recipient recommendation

Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to

receive the email


Why?

Ò Understanding communication in/structure of an enterprise

Ò Finding “unexpected” communication Ò Applications in:

Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection


How?

Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork

Ò Related work Ò Social Network Analysis (SNA) Ò Email content

Ò Us Ò SNA + email content


Part 1: Social Network Analysis?

[email protected] [email protected]

[email protected]


image by Calvinius - Creative Commons Attribution-Share Alike 3.0


SNA for predicting recipients?

1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email

2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient


Part 2: Email content

Ò Statistical Language Models (LMs)

Ò Assign a probability to [a sequence of] words; Ò By counting words

Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition


Language Models

Ò Language models as communication “profiles”


Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)


Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)


Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2)


Language Models


talks with node2)


Language Models


talks with node2) 4. Corpus LM (how everyone

talks)


Why language models?

Ò Comparisons between communication profiles: Ò Find nodes with most similar communication


Model

Ò Given sender and email, predict recipients Ò Ranking function:


Email likelihood Estimate using language modeling

Sender likelihoodusing SNA to estimate closeness of R and S

Recipient likelihoodusing SNA to estimate importance of R


Email likelihood


Email likelihood

P(word|R,S) P(word|R) P(word)


Strength of connection between two nodes

1. Number of emails sent between nodes 2. Number of times two nodes are addressed together

Importance of node 1. Number of emails received 2. PageRank score

Recipient Likelihood P(R)

P(R)

P(S|R)

Sender Likelihood P(S|R)


SNA

1. Importance of a node in the network

2. Strength of connection between nodes

Email Content

1. Interpersonal LM 2. Recipient LM 3. Corpus LM


Approach: time-based

time

Training period: build models (SNA + LM)

Testing period: predict recipients


Testing

Ò Remove recipients from email Ò Rank all nodes in the network, by computing:

1. P(E|R,S): Similarity between sender and candidate LMs

2. P(S|R): Strength of connection between sender and candidate

3. P(R): Importance of candidate

Testing period: predict recipients


Findings: What works?

Ò Importance of node: Number of received emails of nodePagerank

Ò Strength of connection: Number of emails between nodesNumber of times co-addressed

Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)


Analysis: SNA vs email content

Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly

active users

Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users


Finally

Ò Combining Social Network Analysis with Language Modeling is better than doing either.


Future work

Ò Consider structure of network in more detail Ò Departments? Ò Friends/family?

Ò Include ‘time decay’

Ò Dynamically weight LM/SNA?


Applications in E-Discovery/Digital Forensics

Ò Anomaly detection Ò Given a working prediction model; identify

“unexpected” communication Ò Language models for communication

Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues?

Ò Find communication that differs from the corpus-based communication


Fin

Ò Questions?