SDOW (ISWC2011)

27
DIGITAL Institute for Information and Communication Technologies Pragmatic metadata matters: How data about the usage of data affects semantic user models Claudia Wagner, Markus Strohmaier, Yulan He Sunday, October 23, 2011

description

http://sdow.semanticweb.org/2011/

Transcript of SDOW (ISWC2011)

Page 1: SDOW (ISWC2011)

DIGITAL Institute for Information and Communication Technologies

Pragmatic metadata matters:How data about the usage of data affects

semantic user modelsClaudia Wagner, Markus Strohmaier, Yulan He

Sunday, October 23, 2011

Page 2: SDOW (ISWC2011)

2

ExampleSemantic Metadata

sioc:UserAccount

rdf:type

sioc:name

sioc:Post

rdf:type

sioc:content

sioc:has_creator

foaf:Personsioc:account_of

Sunday, October 23, 2011

Page 3: SDOW (ISWC2011)

3

ExamplePragmatic Metadata

Sunday, October 23, 2011

Page 4: SDOW (ISWC2011)

3

ExamplePragmatic Metadata

Sunday, October 23, 2011

Page 5: SDOW (ISWC2011)

3

ExamplePragmatic Metadata

Sunday, October 23, 2011

Page 6: SDOW (ISWC2011)

3

ExamplePragmatic Metadata

Sunday, October 23, 2011

Page 7: SDOW (ISWC2011)

3

ExamplePragmatic Metadata

Sunday, October 23, 2011

Page 8: SDOW (ISWC2011)

4

Aim

sioc:UserAccount

rdf:type

sioc:name

sioc:Post

rdf:type

sioc:topic

sioc:content

sioc:has_creator

foaf:Personsioc:account_of

?foaf:interest

?

Can pragmatic metadata support the generation of semantic metadata and if yes how?

Sunday, October 23, 2011

Page 9: SDOW (ISWC2011)

5

Experimental Setup§ Methodology

§ Topic Modeling Algorithms to learn topics (probability distributions of words) and annotate users and posts with topics

§ Incorporated different types of pragmatic metadata into the Topic Models

§ Compared different models via their predictive performance

§ Dataset§ Boards.ie§ Forums, Posts and Users§ User`s authoring and replying behavior

§ Training Dataset: First and last week of February 2006§ Test Dataset: 3 future posts of each user

Sunday, October 23, 2011

Page 10: SDOW (ISWC2011)

6

Evaluation

§ Compare different models by testing their predictive performance on held out posts.

§ Assumption: a better user topic model reacts less perplex on future posts authored by a user and needs less trainings samples.

Sum over all words in a user`s future post

Log Likelihood of a word of user`s future post given the model we learned

Sunday, October 23, 2011

Page 11: SDOW (ISWC2011)

7

MethodologyLDA

§ How to learn topics and annotate users with topics?

§

T1 T2 T3

T1:mac: 0.3iMac: 0.13PC: 0.03computer: 0.04....

Text

Latent Dirichlet Allocation (LDA) (Blei et al, 2003)

Sunday, October 23, 2011

Page 12: SDOW (ISWC2011)

8

MethodologyDMR

§ How to incorporate metadata into topic models?

§ Dirichlet Multinomial Regression (DMR) Topic Models (Mimno et al, 2008)

§ Observe feature vector x per document§ Draw „fresh“ alpha for each document which depends

on observed features x and the feature distribution per topic λt

∝dt= exp(λt Xdt)

Sunday, October 23, 2011

Page 13: SDOW (ISWC2011)

9

Methodology

ID Alg Doc Metadata

M1 LDA Post -M2 LDA User -

M3 DMR Post author

M4 DMR User author

M5 DMR Post reply-user

M6 DMR User reply-user

M7 DMR Post related-user

M8 DMR User related-user

authored Post 1

Post 2

Post 3

Post 4

Post 5

Post 6

replies toUser 1

User 2

authored

Post 7 Future

Past

Sunday, October 23, 2011

Page 14: SDOW (ISWC2011)

10

§ Different user activities performed on content

Baseline  LDA  (M1  and  M2)

Post  training  scheme  (M3,  M5  and  M7)

Models  which  take  user  replies  into  account.(M6  and  M8)

Sunday, October 23, 2011

Page 15: SDOW (ISWC2011)

11

Results

ID Alg Doc Metadata

M1 LDA Post -M2 LDA User -

M3 DMR Post author

M4 DMR User author

M5 DMR Post reply-user

M6 DMR User reply-user

M7 DMR Post related-user

M8 DMR User related-user

authored Post 1

Post 2

Post 3

Post 4

Post 5

Post 6

replies toUser 1

User 2

authored

Post 7 Future

Past

Sunday, October 23, 2011

Page 16: SDOW (ISWC2011)

11

Results

ID Alg Doc Metadata

M1 LDA Post -M2 LDA User -

M3 DMR Post author

M4 DMR User author

M5 DMR Post reply-user

M6 DMR User reply-user

M7 DMR Post related-user

M8 DMR User related-user

authored Post 1

Post 2

Post 3

Post 4

Post 5

Post 6

replies toUser 1

User 2

authored

Post 7 Future

Past

Sunday, October 23, 2011

Page 17: SDOW (ISWC2011)

11

Results

ID Alg Doc Metadata

M1 LDA Post -M2 LDA User -

M3 DMR Post author

M4 DMR User author

M5 DMR Post reply-user

M6 DMR User reply-user

M7 DMR Post related-user

M8 DMR User related-user

authored Post 1

Post 2

Post 3

Post 4

Post 5

Post 6

replies toUser 1

User 2

authored

Post 7 Future

Past

Sunday, October 23, 2011

Page 18: SDOW (ISWC2011)

11

Results

ID Alg Doc Metadata

M1 LDA Post -M2 LDA User -

M3 DMR Post author

M4 DMR User author

M5 DMR Post reply-user

M6 DMR User reply-user

M7 DMR Post related-user

M8 DMR User related-user

authored Post 1

Post 2

Post 3

Post 4

Post 5

Post 6

replies toUser 1

User 2

authored

Post 7 Future

Past

Sunday, October 23, 2011

Page 19: SDOW (ISWC2011)

11

Results

ID Alg Doc Metadata

M1 LDA Post -M2 LDA User -

M3 DMR Post author

M4 DMR User author

M5 DMR Post reply-user

M6 DMR User reply-user

M7 DMR Post related-user

M8 DMR User related-user

authored Post 1

Post 2

Post 3

Post 4

Post 5

Post 6

replies toUser 1

User 2

authored

Post 7 Future

Past

Sunday, October 23, 2011

Page 20: SDOW (ISWC2011)

11

Results

ID Alg Doc Metadata

M1 LDA Post -M2 LDA User -

M3 DMR Post author

M4 DMR User author

M5 DMR Post reply-user

M6 DMR User reply-user

M7 DMR Post related-user

M8 DMR User related-user

authored Post 1

Post 2

Post 3

Post 4

Post 5

Post 6

replies toUser 1

User 2

authored

Post 7 Future

Past

Sunday, October 23, 2011

Page 21: SDOW (ISWC2011)

12

Results§ The topics of users who reply to a user are also likely for

this user§ Therefore, if 2 users get replies from the same users

than they are more likely to talk about the same topics

§ Topic models which incorporate pragmatic metadata per user can indeed improve models

§ Topic models which incorporate pragmatic metadata per post often over-fit data§ Model Assumptions are too strict!

§ Idea: Incorporate behavioral user similarities§ Intuition: users which are similar are more likely to talk

about the same topics§ How to measure behavioral similarity?

§ forum usage§ communication behavior

Sunday, October 23, 2011

Page 22: SDOW (ISWC2011)

13

Methodology

ID Alg Doc Metadata

M9 DMR Post top 10 forums

M10 DMR User top 10 forums

M11 DMR Posttop 10 communication partner

M12 DMR Usertop 10 communication partner

authored

Post 1

Post 2

Post 3User 1

authored Post 4

Post 5

Post 6User 2

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f15f20 f31 f12 f5f6 f17 f18 f19 f10

Post 7 Future

Past

Sunday, October 23, 2011

Page 23: SDOW (ISWC2011)

14

Baseline  LDA  (M1  and  M2)

Post  training  scheme  (M3,  M9  and  M11)

User  training  scheme  (M4,  M10  and  M12)

Models  M12    incorporates  user  similari;es  based  on  their  communica;onbehavior

Sunday, October 23, 2011

Page 24: SDOW (ISWC2011)

15

Results

§ Topic models seem to benefit from taking behavioral user similarities into account

§ Users who behave similar (regarding their forum usage and communication behavior) are likely to talk about the same topics

§ Common communication-partner seem to be more predictive for common topics than common forums

Sunday, October 23, 2011

Page 25: SDOW (ISWC2011)

16

Conclusions§ Pragmatic metadata may help to learn better semantic

user models

§ But pragmatic metadata observed on a post level often over-fits data

§ Pragmatic Metadata on a user level seems to improve the predictive performance of topic models§ If posts of 2 users are “used” in a similar way then

they are more likely to talk about the same topics § If 2 users behave similar (tend to post to same forums

or tend to talk to same users) they are more likely to talk about same topics.

§ Common communication-partner seem to be more predictive for common topics than common forums

Sunday, October 23, 2011

Page 26: SDOW (ISWC2011)

17

Limitations and Future Work§ Perplexity and semantic interpretability of topics do not

necessarily correlate (Chang et al., 2009)§ Separate evaluation of semantic coherence of topics

§ Analyzing different types of behavior- and usage-related metadata and explore to what extent they may reveal information about the semantics of data§ behavior on social streams such as Twitter§ tagging behavior§ navigation behavior

Sunday, October 23, 2011

Page 27: SDOW (ISWC2011)

18

References

§ David M. Blei, Andrew Ng, Michael Jordan. Latent Dirichlet allocation. JMLR (3) (2003) pp. 993-1022

§ Chang, J., Boyd-graber, J., Gerrish, S., Wang, C. and Blei, D. Reading Tea Leaves: How Humans Interpret Topic Models, Neural Information Processing Systems, NIPS (2009)

§ Mimno, D.M. and McCallum, A. Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression. In Proceedings of UAI. (2008), pp. 411-418

Sunday, October 23, 2011