Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of...

29
Beyond Keyword Beyond Keyword Filtering for Message Filtering for Message and Conversation and Conversation Detection Detection David Skillicorn David Skillicorn School of Computing, Queen’s School of Computing, Queen’s University University Math and CS, Royal Military Math and CS, Royal Military College College [email protected] [email protected]

Transcript of Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of...

Page 1: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Beyond Keyword Filtering for Beyond Keyword Filtering for Message and Conversation Message and Conversation

DetectionDetection

David SkillicornDavid Skillicorn

School of Computing, Queen’s UniversitySchool of Computing, Queen’s University

Math and CS, Royal Military CollegeMath and CS, Royal Military College

[email protected]@cs.queensu.ca

Page 2: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

The problem:

Pick out the most `interesting’ intercepted messages when conventional markers (sender/receivers etc.) are missing.

The solution:

Look for correlated use of words that are used with the “wrong” frequency, caused by substitution to evade keyword filtering.

The technique:

Use singular value decomposition and independent component analysis applied to noun frequency profiles; suspicious related messages appear as outliers.

Messages with ordinary word frequencies and lone eccentrics do not show up. So it can be applied to large sets of messages to select the interesting few.

Page 3: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

THE PROBLEM

Page 4: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Many governments collect and analyze message traffic (e.g.Echelon) – email, file traffic/web, cellphone traffic, radio.

There are 3 levels of analysis:

1. Match the content of individual messages against a watch list of words that suggest the message is suspicious.

German Federal Intelligence Service: nuclear proliferation (2000 terms), arms trade (1000), terrorism (500), drugs (400), as of 2000 (certainly changed now).

Countermeasures: use a speech code (hard in realtime) or use locutions (“the package is ready”).

Main benefit: Changes behavior of those who DON’T want their messages intercepted.

Page 5: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

2. Look for sets of messages that are connected, that form a conversation, based on some of their properties: sender/receiver identities, time of transmission, specialized word use, etc..(Social Network Analysis)

Countermeasures: conceal the connections between the messages by making sure they share no obvious attributes: * use temporary email addresses, stolen cell phones * decouple by using intermediaries * smear time factors e.g. by using web sites

In general, hide in the background noise .

Page 6: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

3. Look for sets of messages that are connected in more subtle ways because of correlation among their properties.

Workable countermeasures are hard to find because:

* conversations are about something, so that correlation in their content arises naturally * sensitivity to watch list surveillance alters the way words are used

We hypothesize that related messages among a threat group in the context of watch list surveillance will be characterized by correlated word use; but that the words will be used with the “wrong” frequencies.

Common words will be used as if they were uncommon; uncommon words will be used as if they were common.

Page 7: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

THE DATA

Page 8: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

The frequency of words in English (and many other languages) is Zipf – frequent words are very frequent, and frequency drops off very quickly.

We restrict our attention to nouns.

In EnglishMost common noun – time3262nd most common noun – quantum

We assume that messages are reduced to a frequency histogram of their nouns (this can be done reliably with a tagger).

Page 9: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

A message-frequency matrix has a row corresponding to each message, and a column corresponding to each noun. The ij th entry is the frequency of noun j in message i .

The matrix is very sparse.

We generate artificial datasets using a Poisson distribution with mean f * 1/j+1 , where f models the base frequency.

We add 10 extra rows representing the correlated threat messages, using a block of 6 columns, uniformly randomly 0s and 1s, added at columns 301—306.

Page 10: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

A message-rank matrix has a row corresponding to each message, and a column corresponding to the rank, in English, of the j th most frequent noun in the message.

Message-rank matrices have many fewer columns, which makes them easier and faster to work with (e.g. Enron email dataset: 200,000+ `words’ but average number of nouns per message <200).

Message-frequency matrices have been extensively studied in IR, but message-rank matrices not at all.

Message-rank messages are insensitive to countermeasures such as using words with almost the right frequency.

Page 11: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

messages

nouns

Page 12: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

messages

rank of jth noun in message

Page 13: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

THE TECHNIQUES

Page 14: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Matrix decompositions.

The basic idea:* Treat the dataset as a matrix, A, with n rows and m columns;* Factor A into the product of two matrices, C and F A = C F where C is n x r, F is r x m and r is smaller than m.

Think of F as a set of underlying `real’ somethings and C as a way of `mixing’ these somethings together to get the observed attribute values. Choosing r smaller than m forces the decomposition to somehow represent the data more compactly.

F A = C

Page 15: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Two matrix decompositions are useful :

Singular value decomposition (SVD) – the rows of F are orthogonal axes such that the maximum possible variation in the data lies along the first axis; the maximum of what remains along the second, and so on. The rows of C are coordinates in this space.

Independent component analysis (ICA) – the rows of F are statistically independent factors. The rows of C describe how to mix these factors to produce the original data.

Strictly speaking, the row of C are not coordinates, but we can plot them to get some idea of structure.

Page 16: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

First 3 dimensions – SVD

The messages with correlated unusual word usage are marked with red circles

Page 17: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

First 3 dimensions – ICA

Page 18: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

(Fortunately) both unusual word use and correlated word use are necessary to make such messages detectable.

Correlation with proper word frequencies (SVD)

So ordinary conversations don’t show up as false positives!!

Page 19: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Correlation with proper word frequencies (ICA)

Page 20: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Uncorrelated with unusual word frequencies (SVD)

Conversations about unusual things don’t show up as false positives either!!

Page 21: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Uncorrelated with unusual word frequencies (ICA)

Page 22: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

This trick permits a new level of sophistication in connecting related messages into conversations when the usual indicators are not available.

It does exactly the right thing – ignoring conversations about ordinary topics, and conversations about unusual topics, but homing in on conversations about unusual topics using inappropriate words.

Because the dataset is sparse, SVD takes time linear in the number of messages. The complexity of ICA is less clear but there are direct hardware implementations.

Page 23: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Message-rank matrices are useful because they defend against the countermeasure of rules like “use the word 5 ranks below the one you want to use”.

Such rules are easy to apply with access to the internet, for example the site www.fabrica.it/wordcount/main.php.

However, this isn’t so easy in real-time communication.

Page 24: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

SVD of message-rank matrix has a fan shape.

Points are labelled with the length of each message

Page 25: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Same plot with messages labelled by the average rank of the nouns they contain.

Length of message and average rank are correlated – partly because of opportunity, but it’s not clear that this the whole story.

Page 26: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

Replacing words with those, say, five positions down the list does not show up in the SVD of a message-frequency matrix:

Page 27: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

But it’s very clear in the SVD of a message-rank matrix:

Page 28: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

We have been applying these techniques to the Enron email dataset, which is a good surrogate for intercepted communications:

* about 500,000 emails* about 1500 people* partially known `command and control’ structure

Early results from several groups were presented at the Workshop on Link Analysis, Counterterrorism and Security:

www.cs.queensu.ca/home/skill/siamworkshop.html

also

New York Times Week in Review this weekend

Page 29: Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.

?