Post on 24-Feb-2016
description
FINDING EVENT-SPECIFIC INFLUENCERS IN DYNAMIC SOCIAL NETWORKSMasters Thesis – Chris SchenkDecember 1st, 2010
OUTLINE Problem overview
Influencers, reputation, validation and security Summary of analysis methods Boulder fire data
Twitter Data API, formats, collection and data limitations Statistics
Finding event-specific influencers – Rankings Stats Hyperlink-Induced Topic Search (HITS) Context-specific in-degree (original work)
Conclusions and Future Work
PROBLEM OVERVIEW
INFLUENCERS Social dynamics vs online social dynamics
Social network features Search, friends, re-tweets
Influencers and sheep What is meant by influence?
Understanding the data Sampling and baseline statistics Similarity measures, clustering Semantics, intent (NLP)
Baseline activity
INFLUENCERS – NETWORK STRUCTURE Betweenness/Closeness centrality PageRank/TwitterRank/TunkRank Local/Global hierarchical clustering K-core decomposition K-clique percolation Nearest Neighbor Networks Assortative mixing
HITS Activity Network
TWITTER DATA STATS – BOULDER FIRE Tweets
First day – September 6th, 2010 10:00am to September 7th, 2010 10:00am, Mountain time
First week – September 6th, 2010 10:00am to September 13th, 2010 10:00am, Mountain time
Social graph Five one-day snapshots beginning September 7th, 2010
12:40pm, Mountain time Tweet example
RT @garytx: Article on Twitter's use during #eqnz, #boulderfire, and #sanbrunofire: http://bit.ly/cwI1fi
kate30_CU - 2010-09-13 15:29:24+00:00 Keywords: boulder, boulderfire, fourmilefire,
fourmilecanyon, 4milefire
QUALITATIVELY INFLUENTIAL USERS Sixteen users gathered by Jo White
Used as “ground truth” data for ranking comparison
epiccolorado laurasrecipes HumaneBoulder fishnettesuzanbond CampSteve ConnectColorad
oOrg9
metroseen palen sophiabliu MediamumTanukun eadvocate kate30_CU BoulderChannel
1
TWITTER API AND DATA COLLECTION Search+Track+REST
Unique users for a given event Profiles
Periodic collection Friends/Followers
Periodic collection Tweets
One-time collection Limitations
Rate limits, multi-threading Improper SQL query
TWEET STATSStat First Day First Week
# Tweets (total) 12,147 2,314,700# Users 398 13,955Avg. Tweets/user 30.5 165.9Med. Tweets/user 9.0 38.0# Hashtags (total) 7,422 756,785# Hashtags (unique) 895 66,765Avg. Hashtag occurrence 8.3 11.3Med. Hashtag occurrence 1.0 1.0# Mentions (total) 7,877 1,224,851Avg. Mentions/User 19.9 87.8Med. Mentions/User 1.0 1.0# Users mentioning others
308 (77.39%)
11,036 (79.08%)
TWEET STATS (CONT.)Stat First Day First Week
# Addressed Msgs. 2,291 (18.85%)
368,047 (15.90%)
# Users addressing msgs.
227 (57.04%) 8,404 (60.22%)
# Re-tweet Msgs. 3,994 (32.88%)
504,836 (21.81%)
# Users re-tweeted (global)
1,456 134,204
# Users re-tweeted (fire) 356 (24.45%) 2,085 (1.55%)# URLs (unique) 4,105 1,200,927# Source applications 85 1,026# Users giving location
30 (7.53%) 858 (6.14%)
# Tweets with location 172 (1.42%) 17,093 (0.77%)
GRAPH STATS Timezone: Mountain
2010-09-07
12:40:01
2010-09-08
12:40:01
2010-09-09
12:40:01
2010-09-10
12:40:01
2010-09-11
15:10:01Users (fire)
448 1,631 1,623 1,622 4,093
Users (all) 821,609 2,292,929 2,295,885 2,300,838 4,075,573Edges (fire)
3,142 25,193 25,484 25,664 87,539
Edges (all) 1,510,036 5,361,650 5,370,451 5,372,597 30,458,948
LOCATION DATA – U.S.
LOCATION DATA – DENVER METRO
LOCATION DATA – BOULDER, LONGMONT, BROOMFIELD
USER “FISHNETTE” DATA - AGGREGATE HOURLY TWEET COUNTS
USER “FISHNETTE” DATA – AGGREGATE MONTHLY TWEET COUNTS
HASHTAG COUNTS
ADDRESSED MESSAGES
RE-TWEETS
FINDING INFLUENCERS - RANKINGS Tweets
Number of tweets Username mentions Number of re-tweets
Graph In-degree HITS
all users (sorted by frequency) active users Mentions addressed messages (replies)
Context-specific in-degree Global followers count Active edges (pre-existing network) New Edges
RANKINGS - NUMBER OF TWEETS
RANKINGS – USERNAME MENTIONS
RANKINGS – RE-TWEETS
RANKINGS – IN-DEGREE (FOLLOWERS)
HYPERLINK-INDUCED TOPIC SEARCH (HITS) Hubs
Those that link to many authorities Authorities
Those that are linked to by many hubs Process
Calculate the principle eigenvector of two matrices Followers adjacency matrix (authorities) Friends adjacency matrix (hubs)
Iterative Rankings by highest value descending in
eigenvectors
RANKINGS – HITS – ALL USERS
RANKINGS – HITS – ACTIVE USERS
RANKINGS – HITS – MENTIONS
RANKINGS – HITS – ADDRESSED MSGS.
CONTEXT-SPECIFIC IN-DEGREE RANKING Global followers count
Periodically download user profiles Calculate change in followers count for each snapshot Rank based on overall change, descending
Active edges (includes pre-existing edges) Periodically download friend/follower lists Calculate change in followers count for each snapshot Rank based on overall change, descending
New Edges Periodically download friend/follower lists Calculate change in followers count for each snapshot
Do not count edges that existed prior to the start of the event
Rank based on overall change, descending
RANKINGS – GLOBAL FOLLOWERS COUNT
RANKINGS – ACTIVE EDGES
RANKINGS – NEW EDGES
LIMITATIONS AND MODIFICATIONS On-going influence
Can only measure when a user becomes influential Global popularity masking local influence
User “andrewhyde” News and bot activity
Extra data needed to ignore these users Large events
Data collection limitations How important is a de-follow?
Can identify individual user activity Identifying the sheep
Can equivalently count friends (out-links) created
CONCLUSIONS Notions of influence and interaction are
heavily dependent on social network features No agreement on definitions
Influence measured by features not 100% in use Or features not used in the same way by
everyone Composability problem
HITS ranking no better than global in-degree Context-specific in-degree ranking good!
Needs to be tested on multiple events of varying sizes
FUTURE WORK Understanding “baseline” behavior
For users active (using keywords) during an event
Calculate all given statistics for a user (Klout.com?) Lots of ways to cut the data
Composable factors/measures/attributes Explaining new links created
Models for searching, re-tweeting, hashtags, #ff, etc
Incorporating blogs, forums, news websites Real-time vs not
Informing algorithms with other techniques NLP and more automation Qualitative analysis (crowdsourcing?)
THANKS! QUESTIONS?
REPUTATION Definitions? Scores
Composability Explicit reputation
Ratings, votes Implicit reputation
Client Server
VALIDATION Ground truth
Authorities Armies of grad students Crowd-sourcing?
More data Cross-referencing News websites Blogs Public health and safety (or other)
SECURITY Malicious users
Inflation of reputation Sybil attacks
Reporting Audience? Anonymization