Download - SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS Chen LIN *, Jiang-Ming YANG +, Rui.

SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS

Chen LIN *, Jiang-Ming YANG +, Rui CAI +, Xin-jing WANG +, Wei WANG *, Lei ZHANG +

*Fudan University+Microsoft Research Asia

1

OUTLINE

Motivation Challenges Model Application

Reply reconstruction Junk post detection Expert finding

Experiments Conclusion

2

THREADED DISCUSSIONS

Mailing lists

Chat roomsIMs Web forums

3

root

reply

IMPORTANT DATA SOURCES

4

MINING SEMANTICS & STRUCTURE

5

Junk Identification

Expert Search

Measure post quality

…

CHALLENGE

6

Semantics & Structure

SEMANTIC & STRUCTURE

7

Semantic:Topics

Structure:Who reply to who

CHALLENGE

8

Junk Post

JUNK POST

9

CHALLENGE

10

Post Quality

POST QUALITY

valuable post

11

MODEL

Purpose: Simultaneously modeling semantics Structures

Methodology Intuitive Matrix based Sparse coding

root

reply

12

INTUITION

13

A THREAD HAS SEVERAL TOPICS

14

SEMANTIC REPRESENTATION OF THREAD

D X Θ

Minimize:

post1 post2 … postLword1word2word3…wordV

topic1 … topicTword1word2word3…wordV

post1 post2 … postLtopic1…topicT

15

Project posts to topic space

A POST IS RELATED TO PREVIOUS POSTS

Minimize

16

post1 post2 … postLtopic1…topicTΘ

b:

approximate each post aslinear combination ofprevious posts

A POST IS RELATED TO A FEW TOPICSgovernment

cobol

17

SPARSE SEMANTICS OF POST

D X Θ

Minimize:

post1 post2 … postLword1word2word3…wordV

topic1 … topicTword1word2word3…wordV


18

A POST IS RELATED TO A FEW POSTS

Minimize

19


Θ

Sparse

b:

approximate each post aslinear combination ofprevious posts

OPTIMIZE THEM TOGETHER

Model semantic

Model structure

20

APPLICATIONS

Reply reconstruction Capability of recognizing structure

Junk identification Capability of capturing semantics

Expert finding Capability of measuring post quality

21

REPLY RECONSTRUCTION

22

DocumentSimilarity

TopicSimilarity

StructureSimilarity

DATA SET

Slashdot Apple discussion

23

No.threads 1154

No.posts 203210

Avg.thread len.

176.09

Avg.word/p 73.53

Avg.post/user 15.32

No.threads 4488

No.posts 80008

Avg.thread len.

17.84

Avg.word/p 78.36

Avg.post/user 4.69

BASELINES NP

Reply to Nearest Post RR

Reply to Root DS

Document Similarity LDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background

distribution Project documents to topic and junk topic space

24

EVALUATION

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0.021 0.012 0.289 0.239

RR 0.183 0.319 0.269 0.474

DS 0.463 0.643 0.409 0.628

LDA 0.465 0.644 0.410 0.648

SWB 0.463 0.644 0.410 0.641

SMSS 0.524 0.737 0.517 0.772

25

JUNK IDENTIFICATION

D=

X =

Θ =

Probability of junk

post1 post2 … … … postLword1word2word3…wordV

,

topic1 … topicT topicbgword1word2word3…wordV

post1 post2 … … … postLtopic1…topicTtopicbg

26

DATA SET

Slashdot Apple discussion

27

BASELINES

28

DF

SVM Classify posts as junk posts & non-junk posts

SWBSpecial Words Topic Model with

Background distribution Project documents to topic and junk topic space

EVALUATIONMethod Precision Recall F-measure

SWB 0.48 0.22 0.30

SVM 0.37 0.24 0.20

DF 0.34 0.40 0.36

SMSS 0.38 0.45 0.41

29

EXPERT FINDING Methods

HITS

PageRank

…

30

BASELINES LM

Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06

Achieves stable performance in expert finding task using a language model

PageRank Benchmark nodal ranking method

HITS Find hub nodes and authority node

EABIF Personalized Recommendation Driven by

Information Flow. SIGIR ’06 Find most influential node 31

EVALUATION

32

Bayesian estimate

Method MRR MAP P@10

LM 0.821 0.698 0.800

EABIF(ori.) 0.674 0.362 0.243

EABIF(rec.) 0.742 0.318 0.281

PageRank(ori.) 0.675 0.377 0.263

PageRank(rec.)

0.743 0.321 0.266

HITS(ori.) 0.906 0.832 0.900

HITS(rec.) 0.938 0.822 0.906

DISCUSSION

Parameters vs. Model Complexity Linear regression

SMSS model

Though the number of parameters is increased, the projection space is shrunk by the prior knowledge. 33

Prior knowledge

Prior knowledge

CONCLUSION

Purpose Mine the semantics Mine the structure

Highlight Simultaneously model the

Semantic Structure

Applications are designed to evaluate the model Reply reconstruction Junk identification Expert Finding

34