Post on 11-Aug-2014
description
Steffen Staabstaab@uni-koblenz.de
1WeST
Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Modelling the Web Examples of Modelling Text, Knowledge Networks
and Physical-Social Systems
Steffen Staab
Steffen Staabstaab@uni-koblenz.de
2WeST
What do people want from the Web?
Web as storagelibrary
memory
Web as toolsearch
transaction
Web as social mediumcommunication
cooperation
Web as mirror of selfIdentification
outreach
Steffen Staabstaab@uni-koblenz.de
3WeST
What are some of the footprints people leave?
Steffen Staabstaab@uni-koblenz.de
4WeST
My Agenda in the Large
Web Content Discovering patterns Building tools Understanding
Web Interaction Monitoring Exploiting Guiding Understanding
Web Evolution Monitoring Predicting Guiding Understanding
Steffen Staabstaab@uni-koblenz.de
5WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen Staabstaab@uni-koblenz.de
6WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen Staabstaab@uni-koblenz.de
7WeST
Autocompletion of queries
„UK is“?
Steffen Staabstaab@uni-koblenz.de
8WeST
Language Models
What follows „UK is“?
Conditional probability:
where
Issue:Long word sequences can rarely be observed
Steffen Staabstaab@uni-koblenz.de
9WeST
Modified Kneser-Ney Smoothing of n-grams
If sequence is hard to observethen approximate recursively observing marginal frequencies of
......
Steffen Staabstaab@uni-koblenz.de
10WeST
Modified Kneser-Ney Smoothing of n-grams
If sequence is hard to observethen approximate recursively observing marginal frequencies of
First recursion step:
Problem:If last word in the sequnce is rare, the overall sequence will be rare,
then the approximation will be of low quality.
Steffen Staabstaab@uni-koblenz.de
11WeST
Generalized Language Models [ACL14]
If sequence is too hard to observe, then approximate based on marginal probabilities of
...
recursively.
Core idea of formal solution: Recursively applicable, commutative skip operators
Steffen Staabstaab@uni-koblenz.de
12WeST
Improvement of GLMs [ACL14]
Evaluation measure: Perplexity
Data set: English Wikipedia, different sample sizes
Relative improvement: 2,6% (most training data, smallest model) to13,9% (least training data, largest model)
Perplexity (normalized)
Steffen Staabstaab@uni-koblenz.de
13WeST
Outlook for Generalized Language Models Correcting mistakes that are done in all tools
Lack of appropriate models
Other operators („the wild black cat“) Delete: „the black cat“ Part-of-speech: „the adj adj cat“
Application: e.g. next word prediction
Other data structures Tree-like data Graph data
proposal for Google
current focus
Semantic Web
Steffen Staabstaab@uni-koblenz.de
14WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen Staabstaab@uni-koblenz.de
15WeST
Evolution of Networks [ICWSM 2013]
Additions RemovalsTraining
Link Prediction Problem
Unlink Prediction Problem
Markov assumption:
history irrelevant
Steffen Staabstaab@uni-koblenz.de
16WeST
Related Work in Brief
Prediction feature f assigns a score to node pair (i, j) implies to be ranked above
• Link Prediction: edge likelier to be added• Unlink Prediction: edge likelier to be removed
f (i , j ) > f (i , k ) (i , j) (i , k )
Steffen Staabstaab@uni-koblenz.de
17WeST
Related Work in Brief
Static features degree common-neighbours path3 local-clustering-
coefficient/embeddedness ...
Prediction feature f assigns a score to node pair (i, j) implies to be ranked above
• Link Prediction: edge likelier to be added• Unlink Prediction: edge likelier to be removed
f (i , j ) > f (i , k ) (i , j) (i , k )
Steffen Staabstaab@uni-koblenz.de
18WeST
Unlink prediction is much more difficult than link prediction
The Snapshot View
Link and unlink prediction
(ICWSM 2013)
Steffen Staabstaab@uni-koblenz.de
19WeST
Related Work in Brief
Additions RemovalsTraining
Link Prediction Problem
Unlink Prediction Problem
Markov assumption:
history irrelevant
Advantage: General ModelDisadvantage: General Model
IdeaKeep generality,
improve prediction
Steffen Staabstaab@uni-koblenz.de
20WeST
Our Approach - 1
Additions RemovalsTraining
Link Prediction Problem
Unlink Prediction Problem
Markov assumption:
history irrelevant
Hypothesis: Temporal information generally improves prediction
Idea1 Nodes concerned2 Neighbourhood
Steffen Staabstaab@uni-koblenz.de
21WeST
Our Approach - 2
Dynamic features:+ recency+ longevity
Extrapolation for temporal preferential attachment:
Steffen Staabstaab@uni-koblenz.de
22WeST
Evaluation & Discussion (excerpt)
Temporal link prediction significantly better, but only sightly Temporal unlink prediction always significantly improved Temporal preferential attachment best
AUC baselinequalitativequantitativeextrapolation
Steffen Staabstaab@uni-koblenz.de
23WeST
Outlook for Evolution of Networks
Temporal dynamics still underexplored lack of datasets! next experiments:
• Twitter followers• Xing.de
Unlinks lead to link recommendation new Wikipedia link (reorganization of Wikipedia pages!) new job new friend
Steffen Staabstaab@uni-koblenz.de
24WeST
1. Modelling Text
My Agenda for Today
Web Content Web Interaction
Web Evolution
2. Modeling Network
Evolution3. Modeling Physical-
social Data
Steffen Staabstaab@uni-koblenz.de
25WeST
fish, rice
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
fish, salmon, wine
rice, fish
lobster, seafood, shrimp
coffee
coffee, wine
coffee
wine
wine
pizza, wine
pizza, wine
pasta, wine
pasta, shrimplobster, shrimp
seafood, shrimp
Tagged photos with geo-coordinates from Flickr
Steffen Staabstaab@uni-koblenz.de
26WeST
fish, rice
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
seafood, shrimp
lobster, shrimp
Tasks: Discovering topics, finding clusters
Steffen Staabstaab@uni-koblenz.de
27WeST
Cultural areas, country borders, geographical features and other geographical observations exhibit complex spatial distributions
wikipedia.org
Challenge
Steffen Staabstaab@uni-koblenz.de
28WeST
fish, rice
lobster, shrimp
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
seafood, shrimp
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
A. Ahmed, L. Hong and A. Smola, 2013 (following (Yin et al 2011; Sizov 2010))
Existing approaches: Gaussian regions
Steffen Staabstaab@uni-koblenz.de
29WeST
fish, rice
lobster, shrimp
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
seafood, shrimp
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
MGTM 1: Global Topic Clustering
Steffen Staabstaab@uni-koblenz.de
30WeST
fish, rice
lobster, shrimp
seafood, fish seafood, shrimp lobster, wine
seafood, fish, salmon
seafood, shrimp
fish, salmon, wine
seafood, shrimp
lobster, seafood, shrimp
coffee
coffee, wine
coffeeitalian, wine
wine
pizza, wine
italian, pizza, wine
pasta, wine
pasta, shrimp
seafoodfishlobstershrimpcrabwinesalmon
winepizzacoffeeitalianpasta
MGTM 2: Determining Neighbourhoods
Steffen Staabstaab@uni-koblenz.de
31WeST
Cluster adjacency Dependencies of document-specific topic distributions
Exchange of topic information between clusters
MGTM 3: Derived Topic Model
Steffen Staabstaab@uni-koblenz.de
32WeST
Exchange of topic information between clusters
MGTM 4: Exchange of Topic Information
Steffen Staabstaab@uni-koblenz.de
33WeST
Exchange of topic information between clusters
MGTM 4: Exchange of Topic Information
Steffen Staabstaab@uni-koblenz.de
34WeST
Exchange of topic information between clusters
MGTM 4: Exchange of Topic Information
Steffen Staabstaab@uni-koblenz.de
36WeST
Evaluation: Anectodal, Perplexity, Gaming
Gaming study: intrusion detection
Precision 8 topicsavg / median
LGTA 0.60 / 0.58
Basic model 0.64 / 0.58
MGTM 0.78 / 0.75
Steffen Staabstaab@uni-koblenz.de
37WeST
Outlook for LDA with structure
Texts + social network structures scientometry xing.de
Web pages + user visits chefkoch.de
Steffen Staabstaab@uni-koblenz.de
38WeST
Future: Knowledge about social aspects needed
Future: CS style models for social sciences
Steffen Staabstaab@uni-koblenz.de
39WeST
References[ACL14] R. Pickhardt, T. Gottron, M. Körner, P. G. Wagner, T. Speicher, S.
Staab. A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing. In: Proc. of ACL-2014 - The 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, June 22-27, 2014.
[WSDM14] C. Kling, J. Kunegis, S. Sizov, S. Staab. Detecting Non-Gaussian Geographical Topics in Tagged Photo Collections. In: Proc. of the 7th ACM Conference on Web Search and Data Mining (WSDM2014), New York, US, February 24-28, 2014.
[ICWSM13] J.Preusse, J.Kunegis, M.Thimm, T.Gottron, S. Staab. Structural Changes in Collaborative Knowledge Networks. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM 2013), Boston, July 8-10, 2013.
Steffen Staabstaab@uni-koblenz.de
40WeST
Semantic Web
Social Web & Web Retrieval
Interactive Web & Human Computing
Web & Economy
Software & Services
Web Science & Technologies Team & Research
Computational Social Science
Thank You!
Steffen Staabstaab@uni-koblenz.de
41WeST
Maslows pyramid of needs