Understanding email trafficDavid Graus, University of Amsterdam [email protected] @dvdgrs
Dec. 12, 2014 - Frontiers of Forensic Science 2
Some background…
• PhD candidate at ILPS • Information Extraction & Retrieval
• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 3
Some background…
• PhD candidate at ILPS • Information Extraction & Retrieval
• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 5
Information Retrieval?
Ò Finding material of unstructured nature from large collections
Dec. 12, 2014 - Frontiers of Forensic Science 6
Information Extraction?
Ò Text mining Ò Discovering patterns in text data
Dec. 12, 2014 - Frontiers of Forensic Science 9
E-Discovery?
• Retrieving and securing digital forensic evidence
Dec. 12, 2014 - Frontiers of Forensic Science 11
Semantic Search in E-Discovery
• Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web • (Google won’t help us here)
Dec. 12, 2014 - Frontiers of Forensic Science 12
Search in E-Discovery¢ Finding out who knew what, from whom, and when¢ We don’t know what we’re looking for¢ What we’re looking for might be deliberately hidden¢ Communication might be very domain-specific,
contextualized or incomplete
Dec. 12, 2014 - Frontiers of Forensic Science 13
Approach¢ Generic search is not the answer
¢ Google: high precision search¢ E-Discovery: high recall & exploratory search
Dec. 12, 2014 - Frontiers of Forensic Science 14
Tasks¢ Support iterative search¢ Support (re)formulating questions and hypotheses¢ Retrieve all relevant traces
Dec. 12, 2014 - Frontiers of Forensic Science 17
Recipient recommendation
Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to
receive the email
Dec. 12, 2014 - Frontiers of Forensic Science 18
Why?
Ò Understanding communication in/structure of an enterprise
Ò Finding “unexpected” communication Ò Applications in:
Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection
Dec. 12, 2014 - Frontiers of Forensic Science 19
How?
Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork
Ò Related work Ò Social Network Analysis (SNA) Ò Email content
Ò Us Ò SNA + email content
Dec. 12, 2014 - Frontiers of Forensic Science 20
Part 1: Social Network Analysis?
Dec. 12, 2014 - Frontiers of Forensic Science 21
image by Calvinius - Creative Commons Attribution-Share Alike 3.0
Dec. 12, 2014 - Frontiers of Forensic Science 22
SNA for predicting recipients?
1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email
2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient
Dec. 12, 2014 - Frontiers of Forensic Science 23
Part 2: Email content
Ò Statistical Language Models (LMs)
Ò Assign a probability to [a sequence of] words; Ò By counting words
Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition
Dec. 12, 2014 - Frontiers of Forensic Science 24
Language Models
Ò Language models as communication “profiles”
Dec. 12, 2014 - Frontiers of Forensic Science 25
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)
Dec. 12, 2014 - Frontiers of Forensic Science 26
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
Dec. 12, 2014 - Frontiers of Forensic Science 27
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 28
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 29
Language Models
Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1
talks with node2) 4. Corpus LM (how everyone
talks)
Dec. 12, 2014 - Frontiers of Forensic Science 30
Why language models?
Ò Comparisons between communication profiles: Ò Find nodes with most similar communication
Dec. 12, 2014 - Frontiers of Forensic Science 31
Model
Ò Given sender and email, predict recipients Ò Ranking function:
Dec. 12, 2014 - Frontiers of Forensic Science 32
Email likelihood Estimate using language modeling
Sender likelihoodusing SNA to estimate closeness of R and S
Recipient likelihoodusing SNA to estimate importance of R
Dec. 12, 2014 - Frontiers of Forensic Science 35
Strength of connection between two nodes
1. Number of emails sent between nodes 2. Number of times two nodes are addressed together
Importance of node 1. Number of emails received 2. PageRank score
Recipient Likelihood P(R)
P(R)
P(S|R)
Sender Likelihood P(S|R)
Dec. 12, 2014 - Frontiers of Forensic Science 36
SNA
1. Importance of a node in the network
2. Strength of connection between nodes
Email Content
1. Interpersonal LM 2. Recipient LM 3. Corpus LM
Dec. 12, 2014 - Frontiers of Forensic Science 37
Approach: time-based
time
Training period: build models (SNA + LM)
Testing period: predict recipients
Dec. 12, 2014 - Frontiers of Forensic Science 38
Testing
Ò Remove recipients from email Ò Rank all nodes in the network, by computing:
1. P(E|R,S): Similarity between sender and candidate LMs
2. P(S|R): Strength of connection between sender and candidate
3. P(R): Importance of candidate
Testing period: predict recipients
Dec. 12, 2014 - Frontiers of Forensic Science 40
Findings: What works?
Ò Importance of node: Number of received emails of nodePagerank
Ò Strength of connection: Number of emails between nodesNumber of times co-addressed
Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)
Dec. 12, 2014 - Frontiers of Forensic Science 41
Analysis: SNA vs email content
Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly
active users
Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users
Dec. 12, 2014 - Frontiers of Forensic Science 42
Finally
Ò Combining Social Network Analysis with Language Modeling is better than doing either.
Dec. 12, 2014 - Frontiers of Forensic Science 43
Future work
Ò Consider structure of network in more detail Ò Departments? Ò Friends/family?
Ò Include ‘time decay’
Ò Dynamically weight LM/SNA?
Dec. 12, 2014 - Frontiers of Forensic Science 44
Applications in E-Discovery/Digital Forensics
Ò Anomaly detection Ò Given a working prediction model; identify
“unexpected” communication Ò Language models for communication
Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues?
Ò Find communication that differs from the corpus-based communication
Top Related