Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf ·...
Transcript of Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf ·...
![Page 1: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/1.jpg)
Discovering the multifaceted
information hidden within large
user-generated text streams
Daniel Preotiuc-Pietro [email protected]
23.04.2014
![Page 2: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/2.jpg)
Context
• vast increase in user generated content • Online Social Networks
most time-consuming activity on Internet
• multiple modalities: text, time, location, user info, images, etc.
• social network structure • Challenges:
• Engeneering: data volume • Algorithmic: restricted information,
grounded in context, streaming, noise
![Page 3: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/3.jpg)
Motivation
Assumption: Text has different use conditioned on factors such as time, location, etc. Aim: Build models which incorporate these factors Tasks: • Supervised prediction applications
• internal, external • Study the effect of these factors in text use • Improve performance of downstream applications
![Page 4: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/4.jpg)
Outline
i. Introduction ii. Data processing iii. Temporal patterns iv. Text forecasting real-world outcomes v. Spatio-temporal clustering vi. User level properties
![Page 5: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/5.jpg)
TrendMiner project
• `Large scale, cross-lingual trend mining and summarization of real time media streams’
• 6+4 organisations; we work with University of Southampton and SORA on machine learning
• application to predicting political polls and aiding political analysts to make sense of social media data
www.trendminer-project.eu
![Page 6: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/6.jpg)
Text Processing
RT @MediaScotland greeeat!!!lvly speech by cameron on scott's indy :)
#indyref
unorthodox capitalisation OOV words
creative spellings
shortenings
new conventions lack of context
![Page 7: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/7.jpg)
Processing Architecture
• Fast: real time processing, Hadoop MapReduce (I/O bound), online and batch processing
• Scalable: adding more machines
• Modular: easy to add new modules
• Pipeline: the user specifies his needs
• Extensible: different sources of data (USMF format)
• Data consistency: JSON format, append to ‘analysis’
• Reusable: open-source
(ICWSM 2012)
![Page 8: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/8.jpg)
Components
![Page 9: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/9.jpg)
Gaussian Processes
(EMNLP 2013)
Task: Forecast hashtag frequency in Social Media - identify and categorise complex temporal patterns
Non-parametric Bayesian framework • kernelised • probabilistic formulation • propagation of uncertainty • exact posterior inference for regression • Non-parametric extension of Bayesian regression • very good results, but hardly used in NLP
![Page 10: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/10.jpg)
Gaussian Processes
Define prior over functions Compute posterior
(ACL 2014 Tutorial)
![Page 11: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/11.jpg)
Extrapolation
![Page 12: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/12.jpg)
Examples of time series
#FAIL #RAW
#SNOW #FYI
SE
![Page 13: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/13.jpg)
Experimental results
![Page 14: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/14.jpg)
Experimental results
Compared to Mean prediction
![Page 15: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/15.jpg)
Text classification
Task: Assign the hashtag to a given tweet
• Most frequent (MF)
• Naive Bayes model (NB-E)
• Naive Bayes with GP forecast as prior (NB-P)
MF NB-E NB-P
Match@1 7.28% 16.04% 17.39%
Match@5 19.90% 29.51% 31.91%
Match@50 44.92% 59.17% 60.85%
MRR 0.144 0.237 0.252
![Page 16: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/16.jpg)
User behaviour
Task: Predict venue
check-in frequencies
• Modelled using GPs
• Compared to Mean
-150
-100
-50
0
50
100
Linear SE PER PS Select
![Page 17: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/17.jpg)
Individual user behaviour
Task: Predict venue type of user check-in
• highly periodic
• compared to standard Markov predictors
Method Accuracy
Random 11.11%
M.Freq Categ. 35.21%
Markov-1 36.13%
Markov-2 34.21%
Daily period 38.92%
Weekly period 40.65%
(WebScience 2013)
![Page 18: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/18.jpg)
(ACL 2013)
Text based forecasting
Task: predicting real world outcomes
Aim: replace expensive polls with streaming text
• predict political voting intention (not elections!)
• based on social media (Twitter) text
• strong baselines (last day, mean)
• 2 different use cases (UK and Austria)
• UK: 42k users, 60m tweets, 3 parties, 2 years
![Page 19: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/19.jpg)
Linear regression
w xt + β = yt
![Page 20: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/20.jpg)
Linear regression
w, β = argmin (𝑤𝑥𝑖 + 𝛽 − 𝑦𝑖)2
𝑛
𝑖=1
![Page 21: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/21.jpg)
Linear regression
w, β = argmin (𝑤𝑥𝑖 + 𝛽 − 𝑦𝑖)2+ 𝜓𝑒𝑙(𝑤, 𝜌)
𝑛
𝑖=1
LEN – Elastic Net
![Page 22: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/22.jpg)
Bilinear regression
• main issue is noise:
many non-informative users
• we look for a model of
sparse words & sparse users
• bi-convex optimisation problem
• solved by alternatively fixing each set of weights and iterating until convergence
![Page 23: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/23.jpg)
Bilinear regression
u Xt wT + β = yt
![Page 24: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/24.jpg)
Bilinear regression
w, u, β = argmin (𝑢𝑋𝑖𝑤𝑇 + 𝛽 − 𝑦𝑖)
2
𝑛
𝑖=1
![Page 25: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/25.jpg)
Bilinear regression
w, u, β = argmin (𝑢𝑋𝑖𝑤𝑇 + 𝛽 − 𝑦𝑖)
2+ 𝜓𝑒𝑙 𝑤, 𝜌1 +
𝑛
𝑖=1
𝜓𝑒𝑙(𝑢, 𝜌2)
BEN – Bilinear Elastic Net
![Page 26: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/26.jpg)
Bilinear regression
𝑤𝑡 , 𝑢𝑡 , β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑒𝑙 𝑤𝑡, 𝜌1 +
𝑛
𝑖=1
𝜓𝑒𝑙(𝑢𝑡 , 𝜌2)
![Page 27: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/27.jpg)
Bilinear regression
𝑤𝑡 , 𝑢𝑡 , β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑒𝑙 𝑤𝑡, 𝜌1 +
𝑛
𝑖=1
𝜓𝑒𝑙(𝑢𝑡 , 𝜌2)
![Page 28: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/28.jpg)
Bilinear regression
𝑤𝑡 , 𝑢𝑡 , β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑒𝑙 𝑤𝑡, 𝜌1 +
𝑛
𝑖=1
𝜓𝑒𝑙(𝑢𝑡 , 𝜌2)
![Page 29: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/29.jpg)
Bilinear regression
w, u, β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑙1𝑙2 𝑤, 𝜌1 +
𝑛
𝑖=1
𝜏
𝑡=1
𝜓𝑙1𝑙2(𝑢, 𝜌2)
BGL – Bilinear Group LASSO
![Page 30: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/30.jpg)
Quantitative results
Root Mean Squared Error (RMSE) forecasting
results over 50 testing polls (in VI %) BGL
BEN Polls
![Page 31: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/31.jpg)
Quantitative results
Party Tweet Score Author
CON PM in friendly chat with top EU mate, Sweden’s Fredrik Reinfeldt, before family photo
1.334 Journalist
Have Liberal Democrats broken electoral rules? Blog on Labour complaint to cabinet secretary
-0.991 Journalist
LAB Blog Post Liverpool: City of Radicals Website now Live <link> #liverpool #art
1.954 Art Fanzine
I am so pleased to head Paul Savage who worked for the Labour group has been Appointed the Marketing manager for the baths hall GREAT NEWS
-0.552 Politicial (Labour)
LBD RT @user: Must be awful for TV bosses to keep getting knocked back by all the women they ask to host election night (via @user)
0.874 LibDem MP
Blog Post Liverpool: City of Radicals 2011 – More Details Announced #liverpool #art
-0.521 Art Fanzine
![Page 32: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/32.jpg)
• The real-world outcome and users share:
i. region info: London (L), South England (S), Midlands & Wales (MW), North (N), Scotland (Sc) - observed
ii. gender: Male (M), Female (F) - inferred using statistical text-based classifier
iii. age: 18-24, 25-39, 40-59, 60+ - unknown
User features
![Page 33: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/33.jpg)
Recap: Bilinear regression
w, u, β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑙1𝑙2 𝑤, 𝜌1 +
𝑛
𝑖=1
𝜏
𝑡=1
𝜓𝑙1𝑙2(𝑢, 𝜌2)
BGL – Bilinear Group LASSO
![Page 34: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/34.jpg)
Region & Demographics
w, u, β = argmin (𝑢𝑡𝑟𝑋𝑖𝑟𝑤𝑡𝑟 + 𝛽𝑡𝑟 − 𝑦𝑡𝑖𝑟)2
𝑛
𝑖=1
𝜕
𝑟=1
𝜏
𝑡=1
+
𝜓𝑙1𝑙2 𝑤𝑟 , 𝜌1 + 𝜓𝑙1𝑙2 𝑤𝑡 , 𝜌1 + 𝜓𝑙1𝑙2(𝑢𝑟 , 𝜌2)
𝜕
𝑟=1
BGGR
![Page 35: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/35.jpg)
Region & Demographics
S L MW N Sc 𝝁
𝑩𝝁 2.9 3.9 3.2 3.2 3.8 3.4
𝑩𝒍𝒂𝒔𝒕 3.0 4.9 4.3 4.0 5.3 4.3
BGGR 2.6 3.9 3.2 3.0 3.7 3.3
M F 𝝁
𝑩𝝁 2.6 2.1 2.4
𝑩𝒍𝒂𝒔𝒕 2.6 2.4 2.5
BGGR 2.1 2.1 2.1
Regional model
Gender model
![Page 36: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/36.jpg)
Region & Demographics
London Predictions
Female Predictions
![Page 37: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/37.jpg)
Region & Demographics
Conservatives, Positive London
![Page 38: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/38.jpg)
Task: Predict socioeconomic EU indicators
Dataset:
• News summaries from Open Europe think tank
• Daily summaries of EU and member states related news together with their news source
• Feb 2006 – Nov 2013; 1,913 days; 94 months
• 296 news outlets (with >10 summaries)
• Features: unigrams + bigrams
NewsSummaries dataset
(LACSS 2014)
![Page 39: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/39.jpg)
Predictions
ESI (Economic Sentiment Indicator) Unemployment
ESI Unemployment
LEN 9.253 (9.89%) 0.9275 (8.75%)
BEN 8.209 (8.77%) 0.9047 (8.52%)
![Page 40: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/40.jpg)
Economic Sentiment Indicator
![Page 41: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/41.jpg)
Unemployment
![Page 42: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/42.jpg)
Deep linguistic features
• Unigrams (8,912) (cameron) • Bigrams (33,206) (david__cameron) • POS (10,277): Unigrams together with their
part-of-speech (cameron/NNP) • NE (1,013): Entities - Location, Person or
Organisation (Person:David_Cameron) • Annotations (3,392): Link entities to DBpedia
e.g. political party (Org:Conservative_Party), office held (Office:Prime_minister)
![Page 43: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/43.jpg)
Deep linguistic features
Features ESI Unempl.
Unigrams 8.21 1.27
Bigrams 9.66 1.61
Unigrams + Bigrams 8.91 1.47
POS 7.87 1.14
Entities 9.59 1.45
POS + NE 8.09 1.12
NE + Annotations 12.67 1.62
POS + NE + Annotations 10.50 1.31
Unigrams + NE + Annotations 10.92 1.31
Unigrams + Bigrams + NE + Annotations 10.81 1.53
![Page 44: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/44.jpg)
Dimensionality reduction is used to aid browsing large data collections Topic models: • find `topics’ in a collection of documents • `topic’ = a set of semantically coherent words • each document is assigned to a few `topics’ • each word is assigned with a probability to each
`topic’ (soft clustering) • extra factors can be accomodated, e.g. spatio-
temporal dependencies and evolution
Clustering
![Page 45: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/45.jpg)
Temporal topic models
Latent Dirichlet Allocation (LDA) Dirichlet Multinomial Regression (DMR)
![Page 46: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/46.jpg)
• LDA: Documents analysed over time, no temporal conditioning
• Temporal DMR (MId): Documents authored in the same interval share similar topics
• Temporal DMR (TimeRBF): Neighbouring time intervals influence each others
• Regional DMR (OutletId): Documents with similar news source share similar topics
• Regional DMR (DomainId): Documents with similar domain name share similar topics
Temporal & Regional models
![Page 47: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/47.jpg)
Spatio-temporal experiments
Method Perplexity
LDA 4,597
DMR MId 4,575
DMR TimeRBF 4,262
DMR TimeRBF+OutletId 4,086
DMR TimeRBF+OutletId+DomainId 4,036
![Page 48: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/48.jpg)
Experiments: temporal & regional
Top domains: .it 3.44 .fr 0.09 .tv 0.08
.ee 0.06 .ir 0.05
Top outlets: ft.com 0.79
corriere.it 0.68 repubblica.it 0.49
elpais.com 0.45
![Page 49: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/49.jpg)
Experiments: temporal & regional
![Page 50: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/50.jpg)
Experiments: temporal & regional
Top domains: .fr 0.27
.org 0.10 .es 0.08 .ca 0.06 .ch 0.03
Top outlets:
guardian.co.uk 0.61 diplomatie.gouv.fr 0.60
bluesstatedigital.com 0.55 dw-world.de 0.49
![Page 51: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/51.jpg)
Experiments: temporal & regional
![Page 52: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/52.jpg)
User-level properties
• User-level properties:
age, gender, location, social grade, impact
• Aim: understand text use in context of these features - `profile’ users
• Task:
• build a model with good predictive value on held-out users
• interpret the features of this model
![Page 53: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/53.jpg)
User impact
Impact score:
lnlistings ∗ followers2
followees
Data: 38k UK users, 48m deduplicated messages, all tweets from 1 year
Features:
profile info and text
under the user’s control
(EACL 2014)
![Page 54: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/54.jpg)
User impact
• Models:
Linear Regression (LIN)
Gaussian processes (GP)
with ARD kernel
• Features:
User account (18)
Topics from user text (100): derived using spectral clustering on word co-occurrence matrix
Pearson correlation
![Page 55: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/55.jpg)
User impact
Feature Importance
Using default profile image 0.73
Total number of tweets (entire history) 1.32
Number of unique @-mentions in tweets 2.31
Number of tweets (in dataset) 3.47
Links ratio in tweets 3.57
T1 (Weather): mph, humidity, barometer, gust, winds 3.73
T2 (Healthcare, Housing): nursing, nurse, rn, registered, bedroom, clinical, #news, estate, #hospital
5.44
T3 (Politics): senate, republican, gop, police, arrested, voters, robbery, democrats, presidential, elections
6.07
Proportion of days with non-zero tweets 6.96
Proportion of tweets with @-replies 7.10
![Page 56: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/56.jpg)
User impact
Impact distribution for users with high (H) values of this feature as opposed to low (L). Red line is the mean impact score.
Number of tweets Number of unique @-mentions
![Page 57: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/57.jpg)
User impact
damon, potter, #tvd, harry
elena, kate, portman
pattinson, hermione, jennifer
senate, republican, gop
police, arrested, voters
robbery, democrats
presidential, elections
Impact distribution for users with high (H) values of this feature. Red line is the mean impact score.
![Page 58: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/58.jpg)
User impact
User scenario: 1. high number of tweets 2. talk about T3 (showbiz) 3. talk about T4 (politics) 4. use links (L) 5. do not use links (NL)
![Page 59: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/59.jpg)
Vasileios Lampos
UCL
www.lampos.net
Trevor Cohn
Melbourne http://dcs.shef.ac.uk/~tcohn/
Sina Samangooei
Southampton
www.sinjax.net
Nikos Aletras
Sheffield http://dcs.shef.ac.uk/~nikos/
Collaborators
![Page 60: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/60.jpg)
References
(ICWSM 2012) Trendminer: An Architecture for Real Time Analysis of Social Media Text.
D. Preotiuc-Pietro, S. Samangooei, T. Cohn, N. Gibbins, M. Niranjan
(HT 2013) Where’s @wally: A classification approach to Geolocating users based on their social ties.
D. Rout, D. Preotiuc-Pietro, K.Bontcheva, T. Cohn (`Ted Nelson’ award)
(WebScience 2013) Mining User Behaviours: A study of check-in patterns in Location Based Social Networks.
D. Preotiuc-Pietro, T. Cohn
(ACL 2013) A user-centric model of voting intention from Social Media.
V. Lampos, D. Preotiuc-Pietro, T. Cohn
(EMNLP 2013) A temporal model of text periodicities using Gaussian Processes.
D. Preotiuc-Pietro, T. Cohn
(EACL 2014) Predicting and Characterising User Impact on Twitter.
V. Lampos, N. Aletras, D. Preotiuc-Pietro, T.Cohn
(LACSS 2014) Extracting Socioeconomic Patterns from the News: Modelling Text and Outlet Importance Jointly.
V. Lampos, D. Preotiuc-Pietro, S. Samangooei, D. Gelling, T. Cohn
![Page 61: Discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf · 2018-01-27 · Discovering the multifaceted information hidden within large user-generated](https://reader035.fdocuments.in/reader035/viewer/2022070723/5f0208137e708231d4023c74/html5/thumbnails/61.jpg)
Thank you !