Research frontiers in online social media...
Transcript of Research frontiers in online social media...
Research frontiers in online social media studies
Understanding content, user behaviors and information diffusion
Emilio Ferrara Center for Complex Networks and Systems Research
School of Informatics and Computing Indiana University Bloomington
August 8, 2013 Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Data Collection • Twitter Streaming API (10% sample of total traffic)
• August, 2010 – present
• ~5TB Compressed
• Real-time access to data from last 9 months related
to 3 themes: US Politics, Social Movements, News
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
August 8, 2013
Detecting early signatures of persuasion in information cascades
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Scope of the project • Data acquisition in streaming scenario from
Social Media (Twitter, FB)
• Extraction of information tokens, so-called
memes
• Clustering of memes
• Meme clusters classification
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Architecture
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Problem statement • Goal: clustering a large volume of tweets, in a
streaming scenario, in topics based on their
similarity.
• Challenges: tweets text is too sparse for
classification, we need to exploit further features:
• Network structure
• Temporal signature
• Meta-data
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Meme definition
• @Mention: the user addresses another user mentioning its username
(Twitter syntax: @)
• #Hashtag: the user tags its message with a “concept” (syntax: #)
• URL: a message can include one/multiple URL(s) in extended or
shortened format
• Phrase: whatever remains after removing mentions, hashtags and
URLs, stemming verbs/nouns, removing stop-words and punctuation.
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Advantages of using memes • More granularity: each tweet is assigned to at least one (or more)
memes
• Efficiency in real-time scenario: each incoming tweet is directly assigned to its meme/s without additional overhead
• Memes can be aggregated each other forming clusters of topics related by content/structure similarity
• We define a set of similarity measures:
• Common user similarity
• Common tweets similarity
• Common document similarity
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Network, memes & content relations
• Social network
• Memes
• Content
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Meme similarity measures
• Common users similarity
• Common tweet similarity
• Content similarity
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
CUS: Common User Similary Cosine similarity:
= 0.77
Example:
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Example: Pmeme1 = {tweet1, tweet2, tweet5} Pmeme2 = {tweet1, tweet2, tweet5, tweet6}
CTS: Common Tweet Similarity
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
CDS: Common Document Similarity
We use once again the cosine similarity but using the TF-IDF matrix
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Linear combination similarity score:
Example: CTS(i,j) = 0.8 CUS(i,j) = 0.7 CDS(i,j) = 0.9 Different weighting schema:
L(i,j) = 0.5 * 0.8 + 0.5 * 0.9 = 0.85 L(i,j) = 0.33 * 0.8 + 0.33 * 0.7 + 0.33 * 0.9 = 0.79 L(i,j) = 1 * 0.9 = 0.9 (MAX)
Linear Combination
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Evaluation method Building a ground-truth dataset:
• Thematic, hand-picked keywords (Twitter trends)
• ~ 2K tweets with keywords classified as trending
• All tweets collected from a given day
• The dataset contains 9 different classes (imbalanced size)
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Goals 1.Determining how much correct is our clustering
solution compared to the ground truth.
2.Determining the best trade-off in the clustering
algorithm configuration considering:
• Quality of obtained partitioning
• Number of obtained clusters
• Size of obtained clusters
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Evaluation metrics
Mutual information: Normalized mutual information:
Individual entropies: H(X), H(Y) Joint entropy: H(X,Y) Conditional entropies: H(X|Y), H(Y|X)
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Clustering algorithms • We investigated different clustering algorithms:
• Static hierarchical clustering
• Hierarchical stream-clustering
• Online K-means clustering
• Algorithms have been evaluated against each
other to determine:
• Tweet clustering vs. meme clustering
• Best clustering algorithm
• Content vs. network features
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Stream-clustering quality
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Cluster Number/Size
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Summary • Efficient clustering must exploit meta-data, content
and network information
• Introduction of memes (hashtags, mentions, URLs, phrases) as a ‘pre-clustering’ step improves performance
• Efficient, scalable, robust clustering algorithm adaptable for working in streaming scenario
• Room for performance improvement adding further
features and exploiting parameter tuning for similarity measures
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI
Current/Future Work • Design of a distributed architecture supporting:
o Distributed MapReduce-based storage with replication; currently evaluating:
• Hadoop/Hbase
• Riak
• Cassandra/Solandra
o Large data-storage required:
• ~50M tweets/day (increasing) ~300G/day
uncompressed data
• Data compression (gz, lzw), support for JSON
o Low latency, data-storage & access & analysis in close-to-real-time scenario
August 8, 2013
Summer Workshop on Algorithms
and Cyberinfrastructure for large
scale optimization/AI