Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf ·...

42
Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social Media Jiliang Tang and Huan Liu Computer Science and Engineering Arizona State University April 26-28, 2012 SDM2012

Transcript of Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf ·...

Page 1: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Data Mining and Machine Learning Lab

Feature Selection with Linked Data

in Social Media

Jiliang Tang and Huan Liu

Computer Science and Engineering

Arizona State University

April 26-28, 2012 SDM2012

Page 2: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Social Media

• Explosion of social media generates massive

data in an unprecedented rate

- 200 million Tweets per day

- 3,000 photos in Flickr per minute

-153 million blogs posted per year

Page 3: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Social Media Data

• Massive and high-dimensional social media data

poses challenges to data mining tasks

- Scalability

- Curse of dimensionality

• Feature selection is an effective way to prepare

large-scale, high-dimensional data for effective

data mining

Page 4: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Feature Selection

• Traditional feature selection algorithms

work with “flat" data (attribute-value data)

- Independent and Identically Distributed (i.i.d.)

• Social media data differs from attribute-

value data

- Inherently linked

Page 5: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

Users

Page 6: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

Posts

Page 7: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

User-post

relations

Page 8: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

User-user

following

Page 9: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Representation for Attribute Value Data

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Posts

Page 10: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Representation for Attribute Value Data

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 …. Features

Page 11: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Representation for Attribute Value Data

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Labels

Page 12: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Representation for Social Media Data

User-post relations

1

1 1 1

1

1 1

𝑢1

𝑢2

𝑢3

𝑢4

𝑢1 𝑢2 𝑢3 𝑢4

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Page 13: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Representation for Social Media Data

1

1 1 1

1

1 1

𝑢1

𝑢2

𝑢3

𝑢4

𝑢1 𝑢2 𝑢3 𝑢4

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

User-user relations

Page 14: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Representation for Social Media Data

1

1 1 1

1

1 1

𝑢1

𝑢2

𝑢3

𝑢4

𝑢1 𝑢2 𝑢3 𝑢4

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Social

Context

Page 15: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Problem Statement

• Given labeled data X and its label indicator matrix Y, the

whole dataset F, its social context including user-user

following relationships S and user-post relationships P, we

aim to select K most relevant features from m features on

the dataset F with its social context S and P.

Page 16: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Two Fundamental Problems

• Relation extraction

- What are distinctive relations that can be

extracted from linked data

• Mathematical representation

- How to use these relations in feature selection

formulation

Page 17: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

Relation Extraction

Page 18: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

coPost

• A user can have

multiple posts

Page 19: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

coFollowing

𝑢1 𝑢3

𝑝1 𝑝2

𝑝6

𝑝7

𝑢4 𝑝8 • Two users

follow a

third user

Page 20: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

coFollowed

𝑢1

𝑢2 𝑝1 𝑝2

𝑝3 𝑝5 𝑝4

𝑢4 𝑝8 • Two users

are followed

by a third

user

Page 21: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Following

𝑢1

𝑢2 𝑝1 𝑝2

𝑝5 𝑝4

• A user follows

another user

𝑝3

Page 22: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Post-Post relations

• What do these relations suggest for posts?

Page 23: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Social Correlation Theories

• Homophily

- People with similar interests are more likely to be

linked

• Social influence

- People that are linked are more likely to have

similar interests

Page 24: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

CoPost Hypothesis

• CoPost Hypothesis

- Posts by the same user are more likely to be of

similar topics

𝑢2

𝑝5 𝑝4

𝑝3

Page 25: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

CoFollowing Hypothesis

• CoFollowing

Hypothesis

- If two users follow

the same user, their

posts are likely of

similar topics.

𝑢1 𝑢3

𝑝1 𝑝2

𝑝6

𝑝7

𝑢4 𝑝8

Page 26: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

CoFollowed Hypothesis

• CoFollowed

Hypothesis

- If two users are followed

by the same user, their

posts are likely of similar

topics

𝑢1

𝑢2 𝑝1 𝑝2

𝑝5 𝑝4

𝑢4 𝑝8

𝑝3

Page 27: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Following Hypothesis

• Following

Hypothesis

- If one user follows

another, their posts are more

likely similar in terms of

topics

𝑢1

𝑢2 𝑝1 𝑝2

𝑝3 𝑝5 𝑝4

Page 28: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Modeling CoFollowing Relation

• Two co-following users have similar interested topics

||||

)(^

k

Ff

i

T

k

Ff

i

kF

fW

F

fT

uT kiki

)(

• Users' topic interests

u Nuu

jiF

T

uji

uTuT,

2

2

^^

1,2

2

W||)()(||||W||||YWX||min

Page 29: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

A Reformulation of CoFollowing Relation

• It is equivalent to

ji

j

pofauthortheisuifF

jiH

where

||

1),(

XYEHFFHLXXB

||W||EW)2BWTr(Wmin

TTTT

FI

T

1,2

T

W

Page 30: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

A Unique Problem for LinkedFS

• LinkedFS framework is designed to solve

the following optimization problem

1,2

T

W||W||EW)2BWTr(Wmin

Page 31: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

LinkedFS

Page 32: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Datasets

• BlogCatalog

- Undirected following

http://dmml.asu.edu/users/xufei/datasets.html

• Digg

- Directed Following

http://www.public.asu.edu/~ylin56/kdd09sup.html

Page 33: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Data Characteristics

Page 34: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Experiment Setting

• Metric

- Classification accuracy

- Classifier : LibSVM

• Baseline methods

- ttest (TT)

- InformationGain (IG)

- FisherScore (FS)

- Joint 2,1-Norms(RFS)

Page 35: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Training and Testing

• Testing (50%) and Training (50%)

• Subsample 5%, 25%, 50% from training

data to construct another three training sets

• Numbers of Selected Features

- ( 50,100,200,300)

Page 36: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Results on Digg

Page 37: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Results on Digg

Page 38: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Performance Improvement

Page 39: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Conclusions

• Investigate a new problem of feature selection for

social media data

• Provide a way to capture link information guided

by social correlation theories

• Propose an effective framework, LinkedFS, for

social media feature selection

Page 40: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Future Work

• Sophisticated ways to exploit social context

• Lack of label information (unsupervised)

• Noise and incomplete social media data

• The strength of social ties ( strong and weak ties

mixed)

Page 41: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Acknowledgments

This work is, in part, sponsored by National Science

Foundation via a grant (#0812551). Comments and

suggestions from DMML members and reviewers are

greatly appreciated.

Page 42: Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf · Data Mining and Machine Learning Lab Feature Selection with Linked Data in Social

Questions