SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS
Chen LIN *, Jiang-Ming YANG +, Rui CAI +, Xin-jing WANG +, Wei WANG *, Lei ZHANG +
*Fudan University+Microsoft Research Asia
1
OUTLINE
Motivation Challenges Model Application
Reply reconstruction Junk post detection Expert finding
Experiments Conclusion
2
THREADED DISCUSSIONS
Mailing lists
Chat roomsIMs Web forums
3
root
reply
IMPORTANT DATA SOURCES
4
MINING SEMANTICS & STRUCTURE
5
Junk Identification
Expert Search
Measure post quality
…
CHALLENGE
6
Semantics & Structure
SEMANTIC & STRUCTURE
7
Semantic:Topics
Structure:Who reply to who
CHALLENGE
8
Junk Post
JUNK POST
9
CHALLENGE
10
Post Quality
POST QUALITY
valuable post
11
MODEL
Purpose: Simultaneously modeling semantics Structures
Methodology Intuitive Matrix based Sparse coding
root
reply
12
INTUITION
13
A THREAD HAS SEVERAL TOPICS
14
SEMANTIC REPRESENTATION OF THREAD
D X Θ
Minimize:
post1 post2 … postLword1word2word3…wordV
topic1 … topicTword1word2word3…wordV
post1 post2 … postLtopic1…topicT
15
Project posts to topic space
A POST IS RELATED TO PREVIOUS POSTS
Minimize
16
post1 post2 … postLtopic1…topicTΘ
b:
approximate each post aslinear combination ofprevious posts
A POST IS RELATED TO A FEW TOPICSgovernment
cobol
17
SPARSE SEMANTICS OF POST
D X Θ
Minimize:
post1 post2 … postLword1word2word3…wordV
topic1 … topicTword1word2word3…wordV
post1 post2 … postLtopic1…topicT
18
A POST IS RELATED TO A FEW POSTS
Minimize
19
post1 post2 … postLtopic1…topicT
Θ
Sparse
b:
approximate each post aslinear combination ofprevious posts
OPTIMIZE THEM TOGETHER
Model semantic
Model structure
20
APPLICATIONS
Reply reconstruction Capability of recognizing structure
Junk identification Capability of capturing semantics
Expert finding Capability of measuring post quality
21
REPLY RECONSTRUCTION
22
DocumentSimilarity
TopicSimilarity
StructureSimilarity
DATA SET
Slashdot Apple discussion
23
No.threads 1154
No.posts 203210
Avg.thread len.
176.09
Avg.word/p 73.53
Avg.post/user 15.32
No.threads 4488
No.posts 80008
Avg.thread len.
17.84
Avg.word/p 78.36
Avg.post/user 4.69
BASELINES NP
Reply to Nearest Post RR
Reply to Root DS
Document Similarity LDA
Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background
distribution Project documents to topic and junk topic space
24
EVALUATION
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0.021 0.012 0.289 0.239
RR 0.183 0.319 0.269 0.474
DS 0.463 0.643 0.409 0.628
LDA 0.465 0.644 0.410 0.648
SWB 0.463 0.644 0.410 0.641
SMSS 0.524 0.737 0.517 0.772
25
JUNK IDENTIFICATION
D=
X =
Θ =
Probability of junk
post1 post2 … … … postLword1word2word3…wordV
,
topic1 … topicT topicbgword1word2word3…wordV
post1 post2 … … … postLtopic1…topicTtopicbg
26
DATA SET
Slashdot Apple discussion
27
BASELINES
28
DF
SVM Classify posts as junk posts & non-junk posts
SWBSpecial Words Topic Model with
Background distribution Project documents to topic and junk topic space
EVALUATIONMethod Precision Recall F-measure
SWB 0.48 0.22 0.30
SVM 0.37 0.24 0.20
DF 0.34 0.40 0.36
SMSS 0.38 0.45 0.41
29
EXPERT FINDING Methods
HITS
PageRank
…
30
BASELINES LM
Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06
Achieves stable performance in expert finding task using a language model
PageRank Benchmark nodal ranking method
HITS Find hub nodes and authority node
EABIF Personalized Recommendation Driven by
Information Flow. SIGIR ’06 Find most influential node 31
EVALUATION
32
Bayesian estimate
Method MRR MAP P@10
LM 0.821 0.698 0.800
EABIF(ori.) 0.674 0.362 0.243
EABIF(rec.) 0.742 0.318 0.281
PageRank(ori.) 0.675 0.377 0.263
PageRank(rec.)
0.743 0.321 0.266
HITS(ori.) 0.906 0.832 0.900
HITS(rec.) 0.938 0.822 0.906
DISCUSSION
Parameters vs. Model Complexity Linear regression
SMSS model
Though the number of parameters is increased, the projection space is shrunk by the prior knowledge. 33
Prior knowledge
Prior knowledge
CONCLUSION
Purpose Mine the semantics Mine the structure
Highlight Simultaneously model the
Semantic Structure
Applications are designed to evaluate the model Reply reconstruction Junk identification Expert Finding
34
Top Related