Research frontiers in online social media...

26
Research frontiers in online social media studies Understanding content, user behaviors and information diffusion Emilio Ferrara Center for Complex Networks and Systems Research School of Informatics and Computing Indiana University Bloomington August 8, 2013 Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI

Transcript of Research frontiers in online social media...

Page 1: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Research frontiers in online social media studies

Understanding content, user behaviors and information diffusion

Emilio Ferrara Center for Complex Networks and Systems Research

School of Informatics and Computing Indiana University Bloomington

August 8, 2013 Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI

Page 2: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 3: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 4: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 5: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 6: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Data Collection • Twitter Streaming API (10% sample of total traffic)

• August, 2010 – present

• ~5TB Compressed

• Real-time access to data from last 9 months related

to 3 themes: US Politics, Social Movements, News

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

August 8, 2013

Page 7: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Detecting early signatures of persuasion in information cascades

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 8: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Scope of the project • Data acquisition in streaming scenario from

Social Media (Twitter, FB)

• Extraction of information tokens, so-called

memes

• Clustering of memes

• Meme clusters classification

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 9: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Architecture

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 10: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Problem statement • Goal: clustering a large volume of tweets, in a

streaming scenario, in topics based on their

similarity.

• Challenges: tweets text is too sparse for

classification, we need to exploit further features:

• Network structure

• Temporal signature

• Meta-data

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 11: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Meme definition

• @Mention: the user addresses another user mentioning its username

(Twitter syntax: @)

• #Hashtag: the user tags its message with a “concept” (syntax: #)

• URL: a message can include one/multiple URL(s) in extended or

shortened format

• Phrase: whatever remains after removing mentions, hashtags and

URLs, stemming verbs/nouns, removing stop-words and punctuation.

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 12: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Advantages of using memes • More granularity: each tweet is assigned to at least one (or more)

memes

• Efficiency in real-time scenario: each incoming tweet is directly assigned to its meme/s without additional overhead

• Memes can be aggregated each other forming clusters of topics related by content/structure similarity

• We define a set of similarity measures:

• Common user similarity

• Common tweets similarity

• Common document similarity

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 13: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Network, memes & content relations

• Social network

• Memes

• Content

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 14: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Meme similarity measures

• Common users similarity

• Common tweet similarity

• Content similarity

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 15: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

CUS: Common User Similary Cosine similarity:

= 0.77

Example:

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 16: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Example: Pmeme1 = {tweet1, tweet2, tweet5} Pmeme2 = {tweet1, tweet2, tweet5, tweet6}

CTS: Common Tweet Similarity

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 17: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

CDS: Common Document Similarity

We use once again the cosine similarity but using the TF-IDF matrix

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 18: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Linear combination similarity score:

Example: CTS(i,j) = 0.8 CUS(i,j) = 0.7 CDS(i,j) = 0.9 Different weighting schema:

L(i,j) = 0.5 * 0.8 + 0.5 * 0.9 = 0.85 L(i,j) = 0.33 * 0.8 + 0.33 * 0.7 + 0.33 * 0.9 = 0.79 L(i,j) = 1 * 0.9 = 0.9 (MAX)

Linear Combination

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 19: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Evaluation method Building a ground-truth dataset:

• Thematic, hand-picked keywords (Twitter trends)

• ~ 2K tweets with keywords classified as trending

• All tweets collected from a given day

• The dataset contains 9 different classes (imbalanced size)

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 20: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Goals 1.Determining how much correct is our clustering

solution compared to the ground truth.

2.Determining the best trade-off in the clustering

algorithm configuration considering:

• Quality of obtained partitioning

• Number of obtained clusters

• Size of obtained clusters

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 21: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Evaluation metrics

Mutual information: Normalized mutual information:

Individual entropies: H(X), H(Y) Joint entropy: H(X,Y) Conditional entropies: H(X|Y), H(Y|X)

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 22: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Clustering algorithms • We investigated different clustering algorithms:

• Static hierarchical clustering

• Hierarchical stream-clustering

• Online K-means clustering

• Algorithms have been evaluated against each

other to determine:

• Tweet clustering vs. meme clustering

• Best clustering algorithm

• Content vs. network features

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 23: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Stream-clustering quality

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 24: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Cluster Number/Size

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 25: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Summary • Efficient clustering must exploit meta-data, content

and network information

• Introduction of memes (hashtags, mentions, URLs, phrases) as a ‘pre-clustering’ step improves performance

• Efficient, scalable, robust clustering algorithm adaptable for working in streaming scenario

• Room for performance improvement adding further

features and exploiting parameter tuning for similarity measures

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI

Page 26: Research frontiers in online social media studiessalsahpc.indiana.edu/summerworkshop2013/slides/emilio.pdf · Advantages of using memes • More granularity: each tweet is assigned

Current/Future Work • Design of a distributed architecture supporting:

o Distributed MapReduce-based storage with replication; currently evaluating:

• Hadoop/Hbase

• Riak

• Cassandra/Solandra

o Large data-storage required:

• ~50M tweets/day (increasing) ~300G/day

uncompressed data

• Data compression (gz, lzw), support for JSON

o Low latency, data-storage & access & analysis in close-to-real-time scenario

August 8, 2013

Summer Workshop on Algorithms

and Cyberinfrastructure for large

scale optimization/AI