Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.

Post on 20-Dec-2015

220 views 2 download

Transcript of Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.

Alias Detection in Alias Detection in Link Data SetsLink Data Sets

Master’s Thesis

Paul Hsiung

Alias DefinitionAlias Definition

Alias of names– Dubya = G.W. Bush– Usama = Osama– G.W.Bush = the President

Osama bin Laden = the Emir, the PrinceMisspelled words

– Unintentional (typos)– Intentional : mortgage = m0rtg@ge (Spam)

In What Context Do Aliases In What Context Do Aliases Occur?Occur?

Newspaper articlesWebPagesSpam emailsAny collections of text

Link Data SetLink Data Set

A way to represent the contextCompose of set of names and links

– Names are extracted from the text– Names can refer to the same entity (“Dubya”

and “G.W.Bush”)– Links are collection of names and represent a

relationship between names

ExampleExample

Wanted al-Qaeda terror network chief Osama binLaden and his top aide, Ayman al-Zawahri, haveMoved out of Pakistan and are believed to haveCrossed the mountainous border back intoAfghanistan (Osama bin Laden, Ayman al-Zawahri, al-Qaeda) (Pakistan, Osama bin Laden) (Afghanistan, Osama bin Laden)

Graph RepresentationGraph Representation

Osama

al-Qaeda

Ayman

Pakistan

Afghanistan

AdvantagesAdvantages

Link data set is easily understood by computers

Mimic the way intelligence communities gather data

Alias DetectionAlias Detection

Given two names in a link data set, are they aliases (i.e. do they refer to the same entity?)

How to measure their alias-ness?Semi-supervised learning

Orthographic MeasuresOrthographic Measures

String edit distance– Minimum number of insertions, deletions, and

substitutions required to transform one name into the other

– SED(Osama, Usama) = 2– SED(Osama, Bush) = 7– Intuitive measure

Some Orthographic MeasuresSome Orthographic Measures

String edit distanceNormalized string edit distanceDiscretized string edit distance

Semantic MeasuresSemantic Measures

But what about aliases such as the Prince and Osama?

Define friends of Osama as people who have occurred in same links with Osama

Through link data sets, number of occurrences of each friend can be collected

Intuition: friends of the Prince look like friends of Osama

Treat friends as probability vectors

Example of FriendsExample of Friendsal-Qaeda

10

5

Islam

CNN2Osama

Comparing Two Friends ListsComparing Two Friends Lists

Osama

al-Qaeda

Music

The Prince

10 2

5 50

Islam

CNN2 8

Some Semantic MeasuresSome Semantic Measures

Dot Product: 10 * 2 + 2 * 8Normalized Dot ProductCommon Friends: 2 (CNN, AlQaeda)KL Distance:

ClassifierClassifier

So we have a link data setWe have some measures of what aliases areWe can easily hand-pick some examples of

aliasesLet’s build a classifier!

Classifier Training SetClassifier Training Set

Positive examples: hand-pick pairs of names in link data set that are known aliases

Negative examples: randomly pick pairs of names from the same link data set

Calculate measures for all the pairs and insert them as attributes into the training set

Classifier Example:Classifier Example:

Classifier : Cross-ValidationClassifier : Cross-Validation

Experimented with Decision Trees, k-Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Logistic Regression

Logistic Regression performed the best

PredictionPrediction

Given a query name in the link data set with known aliases

Pair query name with ALL other namesCalculate attributes for all pairsRun each pair through the classifier and

obtain a score (how likely are they to be aliases?)

ExampleExample

PredictionPrediction

Use the score to sort the pairs from most likely to be an alias to least likely

See where the true aliases lie in the sorted list and produce a ROC curve

Evaluate classifier based on ROC curve

SummarySummary

TrainLogisticRegression

Calc Attributes

Calc Attributes

True alias pairs(no query name) Random pairs

Query name

Run Classifier ROC curve

ROC CurveROC Curve

Start from (0,0) on the graphGo down the sorted listIf the name on the list is a true alias, move

y by one unitIf the name on the list is not a true alias,

move x by one unit

Perfect ROC ExamplePerfect ROC Example

1 2 3

1

2

3

0

name1 name2 true alias? PositionOsama The Prince Yes (0,1)Osama Usama Yes (0,2)Osama The Emir Yes (0,3)Osama Sid No (1,3)Osama Bob No (2,3)Osama John No (3,3)

ROC ExampleROC Example

1 2 3

1

2

3

0

name1 name2 true alias? PositionOsama The Prince Yes (0,1)Osama Bob no (1,1)Osama Usama Yes (1,2)Osama Sid No (2,2)Osama John No (3,2)Osama The Emir Yes (3,3)

ROC: NormalizeROC: Normalize

0.3 0.6 1

0.3

0.6

1

0

Balance positive and negative examples

Area under curve(AUC) = 5/9

Able to average multiple curves

Empirical ResultsEmpirical Results

Test on one web page link data set and two spam link data sets

Hand pick aliases for each set

Empirical ResultsEmpirical Results

Choose an alias from the set of hand pick aliases as a query name

Build classifier from other aliases that are not aliases with the query name

Do prediction and obtain ROC curveRepeat for each alias in the set of hand pick

aliasesAverage all ROC curves by normalized axis

EvaluationEvaluation

We want to know how significant is each group of attributes

Train one classifier with just orthographic attributes

Train another with just semantic attributesTrain a third with both sets of attributesCompare curve and area under curve (AUC)

Terrorist Data SetTerrorist Data Set

Manually extracted from public web pagesNews and articles related to terrorismNames mentioned in the articles are

subjectively linkedUsed 919 alias pairs for training

Web Page ChartWeb Page Chart

Spam Data SetSpam Data Set

Collection of spam emailsFilter out html tagsAll the words are converted to tokens with

white spaces being the boundariesCommon tokens are filtered (e.g. “the” “a”)Each email represents a linkEach link contains tokens from

corresponding email

ExampleExample

Subject:Mortgage rates as low as 2.95%Ref<suyzvigcffl>ina<swwvvcobadtbo>nce to<shecpgkgffa>day to as low as2.<sppyjukbywvbqc>95% Sa<scqzxytdcua>ve thou<sdzkltzcyry>sa<sefaioubryxkpl>nds of

dol<scarqdscpvibyw>l<sklhxmxbvdr>ars or b<skaavzibaenix>uy the <br>ho<solbbdcqoxpdxcr>me of yo<svesxhobppoy>ur dr<sxjsfyvhhejoldl>eams!<br>

Filtered to:(mortgage, rates, low, refinance, today,

save, thousands, dollars, home, dreams)

Spam I ChartSpam I Chart

Spam II ChartSpam II Chart

ConclusionConclusion

Orthographic measures work wellSemantic sometimes better, sometimes

worse than orthographicCombining them produces the bestFuture work includes adding other measures

such as phonetic string edit distanceLarger question: many aliases to many

names