SocialSensor: Sensing User Generated Input for Improved Media Discovery and Experience
Social Multimedia Crawling & Mining
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media Streams
Dr. Yiannis Kompatsiaris, Project CoordinatorSamos 2013 Summit on Digital Innovation for Government, Business and Society
#2
Overview
• Motivation• Objectives • Architecture• Use Cases and Requirements• News
– Social Multimedia Crawling & Mining• Infotainment
– EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media Streams
• Conclusions
#3
What is SocialSensor?
• 3-year FP7 European Integrated Project– http://www.socialsensor.eu
• Members: CERTH, ATC (Greece), Deutsche Welle, University Koblenz, Research Center for Artificial Intelligence (Germany), The City University London, Alcatel – Lucent Bell Labs, JCP Consult (France), University of Klagenfurt (Austria), IBM Israel, Yahoo Iberia + Robert Gordon University Aberdeen (UK)
• 1.5 years into the project (Development of user requirements, use case scenarios, architecture and implementation and first R&D components and prototypes. Currently: Evaluation and 2nd round)
Motivation: Social Networks as Sensors
• Social Networks is a data source with an extremely dynamic nature that reflects events and the evolution of community focus (user’s interests)
• Transform individually rare but collectively frequent media to meaningful topics, events, points of interest, emotional states and social connections
• Mine the data and their relations and exploit them in the right context
• Scalable mining and indexing approaches taking into account the content and social context of social networks
Relevant ApplicationsXin Jin, Andrew Gallagher, Liangliang Cao, Jiebo Luo, and Jiawei Han. The wisdom of social multimedia: using flickr for prediction and forecast, International conference on Multimedia (MM '10). ACM.
Federal Emergency Management Agency plans to engage the public more in disaster response by sharing data and leveraging reports from mobile phones and social media
5
“…if you're more than 100 km away from the epicenter [of an earthquake] you can read about the quake on twitter before it hits you…”
Objective
SocialSensor quickly surfaces trusted and relevant material from social media – with context
DySCODySCO
behaviour
location
timecontent
usage
social context
Massive social mediaand unstructured web
Social media miningAggregation & indexing
News - InfotainmentPersonalised access
Ad-hoc P2P networks
#7
The SocialSensor Vision
SocialSensor quickly surfaces trusted and relevant material from social media – with context.
•“quickly”: in real time•“surfaces”: automatically discovers, clusters and searches •“trusted”: automatic support in verification process•“relevant”: to the users, personalized•“material”: any material (text, image, audio, video = multimedia), aggregated with other sources (e.g. web)•“social media”: across many relevant social media platforms•“with context”: location, time, sentiment, influence, trust
#10
Conceptual Architecture and Main components
SEMANTIC MIDDLEWARE
Public Data
In-project Data
SEARCH & RECOMMENDATION
USER MODELLING & PRESENTATION
INDEXINGMINING
STORAGE
DATA COLLECTION / CRAWLING
• Real time dynamic topic and event clustering
• Trend, popularity and sentiment analysis
• Calculate trust/influence scores around people
• Personalized search, access & presentation based on social network interactions
• Semantic enrichment and discovery of services
DySCO concept
• Integrate social content mining, search and intelligent presentation in a personalized, context and network-aware way, through the new concept of Dynamic Social COntainers (DySCOs)• Composite objects containing a number of items (e.g. articles,
tweets, images, videos)• Focused on a particular topic of interest (e.g. an event, a story)• Contain all available information about the topic• Metadata can be added dynamically• Ability to search for DySCOS by matching “DySCO features” with
“search features”• Ability to check and recommend similar DySCOs
recommendations
#11
#13
Use Cases: News
#14
“It has changed the way we do news”(MSN)
“Social media is the key place for emerging stories – internationally, nationally, locally” (BBC)
“Social media is transforming the way we do journalism”(New York Times)
Source: picture alliance / dpa
#15
#16
Source: Getty Images
“It’s really hard to find the nuggets of useful stuff in an ocean of content” (BBC)
“Things that aren’t relevant crowd out the content you are looking for” (MSN)
“The filters aren’t configurable enough” (CNN)
Verification was simpler in the past...
Source: Frank Grätz
#17
#18
An example: BBC Verification Procedure: Arab Spring Coverage• Referencing locations against maps and existing images
from, in particular, geo-located ones.• Working with our colleagues in BBC Arabic and BBC
Monitoring to ascertain that accents and language are correct for the location.
• Searching for the original source of the upload/sequences as an indicator of date.
• Examining weather reports and shadows to confirm that the conditions shown fit with the claimed date and time.
• Maintaining lists of previously verified material to act as reference for colleagues covering the stories.
• Checking scenery, weaponry, vehicles and licence plates against those known for the given country.
News application• Real-time Search/browse news items crawled from different social media • Automatically discovered trending topics• Web analytics• Sentiment scores for topics
#19
Alethiometer• Measuring the degree of truth behind tweets• Overall trust score for a tweet• Various Contributor, Content, Context validity metrics
#20
#21
Social Multimedia Crawling & MiningCase study: #OccupyGeziE. Schinas, S. Papadopoulos, I. Tsampoulatidis, K. Iliakopoulou, Y. Kompatsiaris
#22
Multimedia crawling & mining
• Monitor/query multiple sources for shared media content: Twitter, Facebook, Flickr, YouTube, etc.
• Multiple indexing schemes:– text-based (Solr)– visual content-based (SURF+VLAD for feature extraction,
ADC for similarity-based indexing)• Clustering
– geo-spatial (BIRCH)– visual (SCAN)
• Web-based presentation of results
#23
Crawling & mining system deploymentStreamManager
Twitter Facebook Flickr YouTube RSS Instagram160.xx.xx.207
MongoDBWrapper160.xx.xx.207
TextIndexer (Solr)160.xx.xx.207
160.xx.xx.207
MediaFetcher, FeatureExtractor (HDFS)160.xx.xx.58 160.xx.xx.107
Social Focused Crawler (HDFS)160.xx.xx.187
Nutch
Nutch VLAD
FeatureIndexer (HDFS)160.xx.xx.207
IVFADC
Data Mining160.xx.xx.191
Visual Clust. Geo Clust. Statistics
Web server160.xx.xx.116
API (3)API (4)
API (1) API (2)
#24
#OccupyGezi
• Monitors: Keywords: gezipark, taksimgezipark, Taksim, Taksim Gezi ParkLocation: Istanbul
• Current statistics:
#25
Geographical spread of event
• Seems like it is not a localized event (as several official Turkish news sources claimed), but spreads all over Turkey and even in major cities abroad
#26
Different granularities
#27
Trending media by use of clustering
#28
Visual Memes
#29
Statistics
#31
Use Cases: Infotainment
#32
Capturing & mining large-scale events
• Large-scale events attended by thousands of people captured by mobile devices in the form of status updates, photos, ratings, etc.
• Challenge: – Organize information around
entities of interest– Extract meaningful insights,
obtain informative summaries
• EventSense framework
#33
Infotainment• Thessaloniki
International Film Festival – 80,000 viewers / 100,000
visitors in 10 days– 150 films, 350 screenings
• Fete de la Musique Berlin– 100,000 visitors every
year– 5,000 musicians
#35
ThessFestThessFest• Thessaloniki
International Film Festival
• Support twitter/comment usage within the app
• Ratings and comments per film
• Feedback aggregation– Votes– Tweets
• Real-time feedback to the organisation and visitors
#36
ThessFest
• Gather “realistic” user requirements• Early showcase and evaluation of SocialSensor
technologies in real-world event scale• Engage users and create an informed user basis• TDF14, 15 + TIFF53
– 1400+ users – 40K+ user sessions– positive response to social media
• Next version– Updated features bases on SocialSensor prototype
Fête de la Musique Berlin app• FETEberlin in App Store and Google Play• More than 100K visitors• About 5K musicians• More than 5K app downloads
App features•Browse and filter detailed program•Interactive maps and routing •Social Sharing•Artists’ and Stages Details•Social MonitoringMain benefits for attendants•Visitors can browse through maps and don’t get lost as stages are numerous•Event schedule is available always and per stage
– Very useful when the server was down and there was no access to the online schedule
#37
Fête de la Musique Berlin app
FETEberlin Facts & User FeedbackUnique Users Sessions Frequency of Use
App Store 2904 13751 2,5 sessions per day
Google Play 2210 12097 2,9 sessions per day
Total 5114 25848 Avg: 2,7 sessions per day
Future Plans•Enhanced Event and Visitor Engagement•Send Last minute updates•Create buzz around the event and make users the event ambassadors•Gain insightful knowledge on the impact that the event made via social media analysis
– Organize better events
#39
#40
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media StreamsCase study: Thessaloniki International Film Festival
E. Schinas, S. Papadopoulos, S. Diplaris, Y. Kompatsiaris, Y. Mass, J. Herzig, L. Boudakidis
#41
Entity Detection
• Entities are defined as lists of properties:– a film consists of a title, description, names of
director(s)/actors• Matching status updates (tweets) to entities relies on
representing both as vectors, using cosine similarity, and thresholding:
m: message (tweet), f: feature (term), M: set of all event messagesboost(f): boosting factor when f is a named entity
#42
Topic detection
• For each new message, find the Nearest Neighbour (NN) using Locality Sensitive Hashing (LSH)
• If similarity exceeds an empirically selected threshold, assign to the topic of NN, otherwise create new topic
• Clusters of one messages are discarded as outliers• Cluster-merging is conducted as a post-processing
step to compensate for topic over-segmentation• Similar approach to (Petrovic et al., 2010)S. Petrovic, M. Osborne, V. Lavrenko. Streaming first story detection with application to twitter. In Human Language Technologies, HLT ’10, pages 181–189, Stroudsburg, PA, USA, 2010. ACL
#43
Sentiment detection (1/2)
• Build positive/negative sentiment classifiers using emoticons
• Build neutral classifier using positive/negative classifiers
• Feature extraction: – Remove stop words, emoticons, terms occurring only
once, trim repeated letters– Negation terms (“not”, “isn’t”) are attached to subsequent
terms to form new unigrams (e.g. “nothappy”)– Treat user mentions, URLs, punctuation, repeated letters
and all-caps words as additional features
#44
Sentiment detection (2/2)
• Naïve Bayes classifier per sentiment• P(f|c) estimate using ML and Laplace correction
• Probability estimate for special features (e.g. user mentions) using Bernoulli model and Laplace correction
• Classification using maximum log-likelihood• Neutral messages:
– Mutual Information (MI)– MI of features– Sentiment intensity of message– Use thresholding for decision
#45
Evaluation
• Case study: 53rd Thessaloniki International Film Festival (TIFF53), Nov 2-11, 2012
• 168 films included in the program (titles, descriptions in Greek and English)
• 3974 tweets using #tiff53• Manual annotation regarding:
– film– sentiment (pos/neg/neut)
• Additional data using ThessFest mobile app:– #bookmarks per film (number of times a user added the film to their
schedule)– #ratings + avg. rating per film
#46
Tweet-film matching
• film = <title, description, directors, actors>• Multiple entity representations using Greek/English/both, uni-/bi-grams• Similarity threshold sensitivity analysis
Pooling multiple representationsthreshold (0.1, 0.3)
#47
Topic analysis
• Top-10 topics• Manual inspection
of clusters:– 53.8% of topic titles
considered informative
– 98.5% of clusters were found to be “clean”
• Topics in time
#48
Sentiment analysis
• Training (using emoticons and Twitter API)– 800K positive & negative tweets for English– 12K positive & negative tweets for Greek
• Tuning (for threshold)– Manually annotated dataset from Thessaloniki Documentary Festival
(similar event)– 325/73/553 in English and 781/216/781 in Greek
• Testing– 324/33/724 in English and 901/315/1667 in Greek
– Best accuracy (English) ~ 0.75– Performance in Greek much poorer
compared to English need for richer training corpus
pos neg neut
#49
Aggregation & summarization (1/2)
#T: number of tweetsPol: polarity of film tweetsSubj: subjectivity of film tweetsR: average rating#R: number of ratings#F: number of times the film was bookmarked
• Films with positive polarity are rated higher. • Films that are tweeted a lot are also more likely to be rated. • Films that are tweet a lot are also more likely to be added to the users’ bookmarks.
#50
Aggregation & summarization (2/2)
Most active & influential Twitter accounts (+sentiment per user)
Most shared photos (+number of retweets)
Conclusions
• Great interest in both use cases• In news social media have transformed both news generation and
consumption• Social media data mining can provide interesting results in
many applications• Not all data always available (e.g. User queries, fb)
– Infrastructure, Policy issues• Technical challenges
– Fusion (multi-modality, context), real-time, noise, big data, aggregation (web, Linked Open Data)
• Applications challenges• User engagement, visualization, become part of existing workflows,
privacy, copyright, commercialization
Thank you!
Top Related