An Empirical Study on Selective Sampling in Active...

44
1 An Empirical Study on Selective Sampling in Active Learning for Splog Detection Taichi Katayama 1 Takehito Utsuro 1 Yuuki Sato 2 Takayuki Yoshinaka 3 Yasuhide Kawada 4 Tomohiro Fukuhara 5 1 University of Tsukuba, 2 Konami Corporation, 3 Tokyo Denki University, 4 Navix Co., Ltd., 5 University of Tokyo, AIRWeb2009, April 21nd, 2009 @Madrid, Spain. WWW2009

Transcript of An Empirical Study on Selective Sampling in Active...

Page 1: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

1

An Empirical Study on Selective Samplingin Active Learning for Splog Detection

Taichi Katayama1

Takehito Utsuro1

Yuuki Sato2

Takayuki Yoshinaka3

Yasuhide Kawada4

Tomohiro Fukuhara5

1University of Tsukuba, 2Konami Corporation, 3Tokyo Denki University,

4Navix Co., Ltd., 5University of Tokyo,

AIRWeb2009, April 21nd, 2009 @Madrid, Spain. WWW2009

Page 2: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

2

Background

• Opinion Mining from Blogs

• Splogs are Serious Noise in Opinion Mining– e.g., larger scale statistics (2008 Mar.)

• 40% of Japanese Blog Articles in BuzzPulse, nifty are Splogs, 2007 Oct. � 2008 Feb.

• Automatic Detection is highly Expected.

Page 3: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

3

keyword stuffed blog

Page 4: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

4

Rumor of “FC Tokyo”(a football

team in Japan)

“FC Tokyo”

Blog snippet retrieved with

“FC Tokyo”

Page 5: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

5

Blog snippet retrieved with

“LOUIS VUITTON Key case”

pop-up advertisement automatically inserted by the blog host system

Page 6: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

6

$50 Software Package for Massive Splog Creation

Featuring• SEO• Affiliate Program

in link in link

satellite

satellite

satellite satellite

satellite

satellite

main site

Page 7: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

7

Background

• Opinion Mining from Blogs

• Splogs are Serious Noise in Opinion Mining– e.g., larger scale statistics (2008 Mar.)

• 40% of Japanese Blog Articles in BuzzPulse, nifty are Splogs, 2007 Oct. � 2008 Feb.

• Automatic Detection is highly Expected.

Page 8: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

8

Previous studies on splog detection

• [P.Kolari 2007]– Words– URLs– Anchor texts– Links – HTML meta tags

• [Y.-R.Lin 2007]– Temporal self similarities of

• Posting time• Posting contents• Affiliated links

• [G.Mishne 2005]– Language models among the blog post , the comment ,and

pages linked by the comments

Page 9: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

9

Evaluation with two data sets“Does splog change over time?”

1. Years 2007-2008 (720 sites)2. Years 2008-2009 (720 sites)

Page 10: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

10

��

��

��

��

��

��

���

� �� � � �� �� �� �� �� �� ���

Recall(%)

Prec

isio

n(%

)

��

��

��

��

��

��

���

� �� � � �� �� �� �� �� ��

Recall(%)

Prec

isio

n(%

)

Recall/Precision curves with confidence measureTrain 07-08(720

sites)

Train 07-08 (360 sites) +08-09 (360

sites)

Train 07-08 (360 sites) +08-09 (360

sites)

Train 07-08(720 sites)

Splog detection Authentic blog detection

Test 08-09 (40 sites)

Page 11: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

11

Purpose of This Research (1)

• Needs for continuously updating splog/authentic blog data setsyear by year

• How to reduce human supervision?

• May active learning framework work?

Page 12: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

12

Purpose of This Research (2)

• Optimal Strategies for Selective Sampling in Active Learning

• Guided by Certain Confidence Measure

random samples,

samples balanced with a

confidence measure

samples with theleast confidence

Page 13: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

13

Outline

1. Definition of splog sites2. Splog detection by Machine learning

– SVM– Confidence Measure– Features

3. Active learning4. Evaluation5. Future works

Page 14: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

14

Definition of splog sites• If one of the followings holds for the given

blog sites, then it is mostly splog– originally written text is not included– originally written text is included but many

• “links top affiliated sites” or• ”advertisement articles” or• “articles with adult content”

are included (judged individually by considering the contents of each blog)

• Otherwise, the given blog sites is an authentic blog

Page 15: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

15

Splog Detection by SVMs

• a tool – TinySVM

• the kernel function:– 2nd order linear

• confidence measure – the distance from the separating hyperplane

to each test instance

Page 16: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

16

A Confidence Measure

��

��

��

Lower Bound (authentic blog)

��

��

��

Separatinghyperplane

Lower Bound (splog)

�:splog�:authentic blog

Page 17: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

17

Features for splog detection

1. Total frequency of URLs not linked from splogs2. Co-occurrence between Noun Phrases and

Splogs• Sum of

3. Noun Phrases in Anchor Texts and linked URLs• Total frequency of anchor text noun phrases

• in splogs• out-linked to splog URLs and Blacklist URLs

• Total frequency of anchor text noun phrases• in splogs• out-linked to authentic blog URLs Whitelist URLs

)phrasenoun,splog(2 w�

Page 18: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

18

Feature1: URLs are not linked from splog

splogAuthentic

blogAuthentic

blogsplog splog

More than one inward links from splogs

more than oneinward links

from authentic blogs

url

included only in splogs

included only in authentic blogs

url

url

url

url

url

url

Whitelist:defined as

these URLs

Blacklist:defined as

these URLs

Page 19: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

19

Value of the Whitelist URLs feature

�������

������

������

������

uu

u

instance testthe

inoffrequencytotal

homepagesblogauthenticof

instancestraining wholein theof

frequencytotal

log

u: Whitelist URLs

Page 20: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

20

Features for splog detection

1. Total frequency of URLs not linked from splogs2. Co-occurrence between Noun Phrases and

Splogs• Sum of

3. Noun Phrases in Anchor Texts and linked URLs• Total frequency of anchor text noun phrases

• in splogs• out-linked to splog URLs and Blacklist URLs

• Total frequency of anchor text noun phrases• in splogs• out-linked to authentic blog URLs Whitelist URLs

)phrasenoun,splog(2 w�

Page 21: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

21

Feature2: Noun Phrases

Training set

�� �w ���� w ��

splog Authenticblog

��� w ���

w: a noun phrase

freq(splog,w)=a freq(splog,�w)=b

freq(authentic blog,w)=c freq(authentic blog,�w)=d

�� w ���� w ���� w ��

��� w ������ �w ���

�� �w ���� �w ��

Page 22: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

22

Value of the splog noun phrase feature

���

����

�����

� instance test in theoffrequencytotal

),splog(log

))()()(()(),splog(

2

22

ww

dcdbcababcadw

w�

Page 23: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

23

Features for splog detection

1. Total frequency of URLs not linked from splogs2. Co-occurrence between Noun Phrases and

Splogs• Sum of

3. Noun Phrases in Anchor Texts and linked URLs• Total frequency of anchor text noun phrases

• in splogs• out-linked to splog URLs and Blacklist URLs

• Total frequency of anchor text noun phrases• in splogs• out-linked to authentic blog URLs Whitelist URLs

)phrasenoun,splog(2 w�

Page 24: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

24

Feature3:Noun Phrases in Anchor Texts and linked URLs

a Splog site s

Blacklist URLs Splog URLsWhitelist URLs Authentic blog

URLs

http://����http://����http://����

http://����http://����http://����

http://����http://����http://����

http://����http://����http://����

http://����http://����http://����

AncfB(w,s)=freq of w

w: a noun phrase in Anchor text

AncfW(w,s)=freq of w

Other URLs

<a href=���>��� w ���</a><a href=���>��� w ���</a>

<a href=���>��� w ���</a>

<a href=���>��� w ���</a><a href=���>��� w ���</a><a href=���>��� w ���</a>

<a href=���>��� w ���</a>

<a href=���>��� w ���</a><a href=���>��� w ���</a>

Page 25: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

25

Noun Phrases in Anchor Texts and linked URLs: two features

� � ��

���

w stwAncfBswAncfB ),(),(log

� � ���

���

wtwAncfWswAncfW ),(),(log

homepagessplogtraining

w: noun phrases: a training splog homepaget: a test instance blog homepage

the value of a feature named anchor text noun phrase out-linked to Whitelist URLsfor a test instance blog homepage

the value of a feature named anchor text noun phrase out-linked to Blacklist URLsfor a test instance blog homepage

Page 26: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

26

Framework of Active learning

Pool of unlabeledinstances

(initial size of 3504)

(1296 splog and2208 authentic

blog)

TrainingSet

(initial sizeof 10)

(4 splog and6 authentic

Blog)

selectivesamplingIn activelearning

Trainingan SVM

classifier

unlabeled4 sites

Humansupervision

labeled4 sites

250 cycles up to 1010 training instances

Page 27: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

27

Statistics of Splog/Authentic Blogs Data Set

390424591445Years2008-2009

total# of authenticblogs

# of splogsData Sets

Page 28: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

28

Strategies of selective sampling(1/2)

Low High

High/Low Balanced

��

��

Separatinghyperplane

��

��

��

Separatinghyperplane

��

��

��

Separatinghyperplane

��

��

��

Separatinghyperplane

��

��

�:splog�:authentic blog

Page 29: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

29

Strategies of selective sampling(2/2)

Low-Sp/Au High-Au

High/Low-Au Balanced-Sp/Au

��

��

Separatinghyperplane

��

��

��

Separatinghyperplane

��

��

��

Separatinghyperplane

��

��

��

Separatinghyperplane

��

��

�:splog�:authentic blog

Page 30: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

30

Outline

1. Definition of splog sites2. Splog detection by Machine learning

– SVM– Confidence Measure– Features

3. Active learning4. Evaluation5. Future works

Page 31: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

31

Measure for Performance evaluationafter active learning cycles

• Recall/Precision– Splog detection

– Authentic blog detection is considered in a similar fashion

• “| Tr |= 3500”, “Random”– “| Tr |= 3500” indicates a classifier trained with the whole 3504

instances in the pool– “Random” indicates a classifier trained with randomly selected

training instances

|Ts(splog)||)Ts(LBDTs(splog)|

recall

|)Ts(LBD||)Ts(LBDTs(splog)|

precision

s

s

s

Page 32: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

32

Lower Bound of the Confidence Measure

��

��

��

��

��

� �

��

��

Separatinghyperplane

Lower Bound (splog)�:splog

�:authentic blog

)( sLBDTsTs(splog): the set ofreference splog sites

Page 33: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

33

Measure for Performance evaluationafter active learning cycles

• Recall/Precision– Splog detection

– Authentic blog detection is considered in a similar fashion

• “| Tr |= 3500”, “Random”– “| Tr |= 3500” indicates a classifier trained with the whole 3504

instances in the pool– “Random” indicates a classifier trained with randomly selected

training instances

|Ts(splog)||)Ts(LBDTs(splog)|

recall

|)Ts(LBD||)Ts(LBDTs(splog)|

precision

s

s

s

Page 34: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

34

��

��

��

��

��

��

��

��

��

���

� �� � � �� �� �� �� ��

Recall(%)

Prec

isio

n(%

)

High/Low-AuLow-Sp/Au

High-Au

Random

Balanced-Sp/Au

|Tr|=3500

Recall/precision curve of Splog detection

Page 35: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

35

��

��

��

��

��

��

��

��

� �� � � �� �� �� �� ��

Recall(%)

Prec

isio

n(%

)

High/Low-AuLow-Sp/Au

High-Au

RandomBalanced-Sp/Au

|Tr|=3500

Recall/precision curve of Authentic blog Detection

Page 36: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

36

Evaluation results: comparison of strategies for selective sampling

|TR|=3500RandomHigh/Low

Blance

HighLow

Low Random

Previous studies of active learning for text classification tasks

Splog/authentic blog detection

Page 37: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

37

Support Vectors• only the support vectors have effect on deciding the

position of the separating hyperplane• the number of support vectors can be regarded as the

complexity of the learning task

��

�Separatinghyperplane

�,��support vector

��

��

Page 38: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

38

��

��

��

���

��

���

���

���

��

� ��� �� �� ��� ��� ��� ��� ��� ��� ���� ����

# of Training Instances

# of

Sup

port

Vec

tors

�High/Low-Au

�Low-Sp/Au

�High-Au

� Random

�Balanced-Sp/Au

Changes in # of Support Vectors

RandomHigh/LowBalance

Low High

Page 39: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

39

Evaluation result: # of support vectors• The number of support vectors linearly

increases• Performance of splog/authentic blog

detection increase much more slowly • About 20% of training instances are

constantly selected as support vectors

• In this task, more effective features should be added.

Page 40: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

40

��

��

��

��

��

��

���

� ��� �� �� ��� ��� ��� ��� ��� ��� ���� ����

# of Training Instances

Prec

ison

(%)

High/Low-Au Low-Sp/Au

High-Au

Random

Balanced-Sp/Au

|Tr|=3500

Change in maximum precision with recall as 30 %

of Splog Detection

Page 41: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

41

��

��

��

��

��

��

��

��

� ��� �� �� ��� ��� ��� ��� ��� ��� ���� ����

# of Training Instances

Prec

isio

n(%

)

High/Low-Au

Low-Sp/Au

High-Au

Ranom

Balanced-Sp/Au

|Tr|=3500

Change in maximum precision with recall as 30 %

of Authentic blog Detection

Page 42: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

42

Evaluation result: # of support vectors• The number of support vectors linearly

increases• Performance of splog/authentic blog

detection increase much more slowly • About 20% of training instances are

constantly selected as support vectors

• In this task, more effective features should be added.

Page 43: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

43

Future works

• Incorporating other features– Post time and intervals – Html structures

• Manual examination of support vectors

Page 44: An Empirical Study on Selective Sampling in Active ...airweb.cse.lehigh.edu/2009/slides/Katayam-active... · 1 An Empirical Study on Selective Sampling in Active Learning for Splog

44

Thanks for your attention