Extracting Relevant & Trustworthy Information from Microblogs

49
Extracting Relevant & Trustworthy Information from Microblogs Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly, Fabricio Benevenuto MPI-SWS, Germany; IIT Kharagpur, India; UFOP, Brazil

description

Extracting Relevant & Trustworthy Information from Microblogs. Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly, Fabricio Benevenuto MPI-SWS, Germany; IIT Kharagpur, India; UFOP, Brazil. My research: The big picture. - PowerPoint PPT Presentation

Transcript of Extracting Relevant & Trustworthy Information from Microblogs

Page 1: Extracting Relevant & Trustworthy Information from Microblogs

Extracting Relevant & Trustworthy Information from Microblogs

Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly,Fabricio Benevenuto

MPI-SWS, Germany; IIT Kharagpur, India; UFOP, Brazil

Page 2: Extracting Relevant & Trustworthy Information from Microblogs

My research: The big pictureThree fundamental trends & challenges in social Web

1. User-generated content sharing can we protect privacy of users sharing personal data?

2. Word-of-mouth based content exchange can we understand & leverage word-of-mouth better??

3. Crowd-sourcing content rating and ranking can we find trustworthy & relevant content sources?

Page 3: Extracting Relevant & Trustworthy Information from Microblogs

Twitter microblogging site An important source for real-time Web content

500 million active users posting 400 million tweets daily

Quality of tweets / content vary widely Any one can post tweets

Celebrities, politicians, news media, academics, spammers

Challenge: Finding relevant & trustworthy content Trustworthy: Thwart spammers and their spam Relevance: Identify authoritative experts on specific

topics

Page 4: Extracting Relevant & Trustworthy Information from Microblogs

Thwarting Spammers in Twitter[WWW 2012]

Part 1

Page 5: Extracting Relevant & Trustworthy Information from Microblogs

Background: How spammers operate Twitter spammers try to gain lots of followers

To promote spam directly To gain influence in the network

Search engines rank tweets based on how influential the user is Most metrics depend on user’s network

connectivity More followers help a user to gain influenceIncentivizes spammers to acquire links to gain influence

Page 6: Extracting Relevant & Trustworthy Information from Microblogs

Acquiring followers via link farming Unrelated users exchange links with each

other To gain more influence based on network

connectivity

Alice

Bob

Charlie

David

Influence based on Influence based on connectivity is improvedconnectivity is improved

Page 7: Extracting Relevant & Trustworthy Information from Microblogs

To thwart spammers We need to

1. Understand link farming activity in Twitter 2. Combat link farming activity in Twitter

Prior works: Focused on detecting spammers Via their characteristics, e.g., follower to following

ratios Rat-race between spammers and spam fighters

We focus on the spammer support network

Page 8: Extracting Relevant & Trustworthy Information from Microblogs

Identifying spammers Used Twitter network gathered from previous

study [ICWSM’10]

Data collected in August 2009 54M nodes, 1.9B links, 1.7B Tweets

Identified accounts suspended by Twitter Account could be suspended for various reasons

Found suspended users that posted blacklisted URLs Includes 41,352 such spammers

Page 9: Extracting Relevant & Trustworthy Information from Microblogs

Spammers farm links at large-scale Spam-targets: 27% of all users followed by at

least one of ~40,000 spammers!

Spam-followers: 82% of all followers have been targeted

Spammers have more followers than random users Avg follower count for Spammers: 234, Random

users: 36

Page 10: Extracting Relevant & Trustworthy Information from Microblogs

Who responds to links from spammers? Small number of followers respond most of

the time

Top 100k followers exhibit high Top 100k followers exhibit high reciprocation of 0.8 on avg.reciprocation of 0.8 on avg.

Top 100k users account for Top 100k users account for 60% of all links to spammers60% of all links to spammers

We call these users We call these users link farmerslink farmers

Page 11: Extracting Relevant & Trustworthy Information from Microblogs

Are link farmers real users or spammers?To find out if they are spammers or real users, we

1. Checked if they were suspended by Twitter 76% users not suspended, 235 of them verified by Twitter

2. Manually verified 100 random users 86% users are real with legitimate links in their Tweets

3. Analyzed their profiles More active in updating their profiles than random users

Page 12: Extracting Relevant & Trustworthy Information from Microblogs

Are link farmers lay or popular users?

Conventional wisdom: Lay users more likely to follow back due to social etiquette Popular users might be more conservative in following others

Probability increases Probability increases with user popularitywith user popularity

Link farmers are popular users with lots of followers

Page 13: Extracting Relevant & Trustworthy Information from Microblogs

Are link farmers lay or popular users? Top 5 link farmers according to Pagerank:

1. Barack Obama: Obama 2012 campaign staff 2. Britney Spears 3. NPR Politics: Political coverage and conversation 4. UK Prime Minister: PM’s office 5: JetBlue Airways

Link farmers include legitimate, popular users & organizations

Page 14: Extracting Relevant & Trustworthy Information from Microblogs

What possibly motivates link farmers? One explanation:

Link farmers have similar incentives as spammers They seek to amass social capital & influence in

the network

Link farmers rank among top 5% influential Twitter users In terms of various metrics like Pagerank &

Followerrank

Page 15: Extracting Relevant & Trustworthy Information from Microblogs

Combating link farming Key challenge:

Real, popular and active users are involved in link farming

Detecting and suspending spammers alone will not help

Insight: Discourage users from following others carelessly Penalize users following anyone found to be bad

Lower the influence scores of users following spammersIncentivizes users to be more careful about who they link to

Page 16: Extracting Relevant & Trustworthy Information from Microblogs

Collusionrank

Borrows ideas from spam defense strategies for Web [WWW’05]

Low Collusionrank score for a user indicates heavy linking to spammers or spam-followers

Requires a seed set of known spammers Twitter operator periodically identifies and updates

spammers

Page 17: Extracting Relevant & Trustworthy Information from Microblogs

Collusionrank

Algorithm:1. Negatively bias the initial scores to the set of spammers

2. In Pagerank style, iteratively penalize userswho follow spammers or those who follow spam-followers

Collusionrank is based on the score of followings of a user

Because user is penalized based on who he follows

Page 18: Extracting Relevant & Trustworthy Information from Microblogs

Evaluating Collusionrank Goal:

To penalize spammers and spam-followers Should not penalize users who are not following

spammers

Used a small subset of 600 spammers as seed set

Compare ranks between Pagerank Pagerank + Collusionrank

Measures influence after accounting for link farming activity

Page 19: Extracting Relevant & Trustworthy Information from Microblogs

Effect of Collusionrank on spammers

40% of spammers 40% of spammers appear in top 20% appear in top 20% according to according to PagerankPagerank

Most of the Most of the spammers get spammers get pushed to last 10% pushed to last 10% positions based on positions based on CollusionrankCollusionrank

Page 20: Extracting Relevant & Trustworthy Information from Microblogs

Effect on link farmers

87% of link farmers in top 87% of link farmers in top 2% users according to 2% users according to Pagerank Pagerank

98% of the link 98% of the link farmers get pushed to farmers get pushed to last 10% positions last 10% positions based on based on CollusionrankCollusionrank

Page 21: Extracting Relevant & Trustworthy Information from Microblogs

Effect on normal users Focus on top 100,000 users according to

Pagerank Analyze the percentile difference in ranks between

Pagerank (P) & Pagerank + Collusionrank (PC) Percentile Difference = ( |PC-P|/N ) x 100

Only 20% of users Only 20% of users get demoted get demoted heavilyheavily

Heavily demoted Heavily demoted users follow many users follow many more spammers more spammers than othersthan others

Collusion rank selectively filters out spammers and spam-followers

Page 22: Extracting Relevant & Trustworthy Information from Microblogs

Summary: Thwarting spammers Spammers infiltrate the Twitter network by farming links

Link farming helps them gain influence to promote spam Search involves ranking users based on connectivity & influence

Analyzed link farming in Twitter by studying spammers Top link farmers are real, active and popular users

Proposed an algorithm Collusionrank to limit link farming Incentivizes users to be careful about who they connect with

Page 23: Extracting Relevant & Trustworthy Information from Microblogs

Finding Topic Experts in Twitter[WOSN 2012] [SIGIR 2012]

Part 2

Page 24: Extracting Relevant & Trustworthy Information from Microblogs

Topic experts in Twitter Twitter is now an important source of current

news 500 million users post 400 million tweets daily

Quality of tweets posted by different users vary widely News, pointless babble, conversational tweets, spam,

Challenge: to find topic experts Sources of authoritative information on specific topics

Page 25: Extracting Relevant & Trustworthy Information from Microblogs

Identifying topic experts in Twitter Existing approaches

Research studies: Pal [WSDM 11], Weng [WSDM 10] Application systems: Twitter Who-To-Follow, Wefollow,

Existing approaches primarily rely on information provided by the user herself Bio, contents of tweets, network features e.g. #followers

We rely on “wisdom of the Twitter crowd” How do others describe a user?

Page 26: Extracting Relevant & Trustworthy Information from Microblogs

Twitter Lists A feature to organize tweets received from

the people whom a user is following

Create a List, add name & description, add Twitter users to the list List meta-data offers cues for who-is-who

Tweets from all listed users will be available as a separate List stream

Page 27: Extracting Relevant & Trustworthy Information from Microblogs
Page 28: Extracting Relevant & Trustworthy Information from Microblogs

Mining Lists to infer expertise Collect Lists containing a given

user U

Identify U’s topics from List meta-data Basic NLP techniques Extract nouns and adjectives

Extracted words collected to obtain a topic document for user

[movies tv hollywood stars entertainment celebrity hollywood …]

Page 29: Extracting Relevant & Trustworthy Information from Microblogs

Lists vs. other features

Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono

Most common words from tweets

celeb, funny, humor, music, movies, laugh, comics, television, entertainers

Most common words from Lists

Profile bio

Page 30: Extracting Relevant & Trustworthy Information from Microblogs

Dataset Collected Lists of 55 million Twitter users who

joined before or in 2009

Our analysis infers topics for 1.3 million users who are included in 10 or more Lists

Page 31: Extracting Relevant & Trustworthy Information from Microblogs

Evaluating inference quality

Quality metrics Is the inference accurate? Is the inference informative?

Evaluation of popular users Celebrities, News media sources, US Senators

Using user feedback

Page 32: Extracting Relevant & Trustworthy Information from Microblogs

Popular users set 1: Celebrities

Biographical Tags

Topics of ExpertisePopular

Perception

government, president, USA,

democratpolitics, government

celebs, leader, famous, current

events

sports, cyclist, athlete

tdf, triathlon, cancercelebs, influential, famous, inspiration

The inferred attributes accurately capture Biographical information Topics of expertise Popular perception about the user

Page 33: Extracting Relevant & Trustworthy Information from Microblogs

Popular users set 2: News media sources The inferred attributes indicate

Primary topics of the media source Perceived political bias (Verified using ADA scores)

MediaBiographical

TagsTopics of Expertise

Popular Perception

CNNmedia, journalist,

bloggerspolitics, sports, tech,

weather, currentinfluential outlets

The Nationmedia, journalist, magazines, blogs

politics, governmentprogressive,

liberal

Townhall.com

media, bloggers, commentary,

journalistspolitics

conservative, republican

GuardianFilm

journalists, reviews

movies, cinema, actors, theatre,

hollywoodfilm critics

Page 34: Extracting Relevant & Trustworthy Information from Microblogs

Popular users set 3: US Senators

Out of the 100 US senators, 84 have Twitter accounts

The inferred attributes correctly infer Their political party The state represented by them Their gender

‘Female’ or ‘Women’ for all 15 female senators Their political ideology

progressive/liberal/conservative/tea-party The senate committees to which they belong

Page 35: Extracting Relevant & Trustworthy Information from Microblogs

Popular users set 3: US Senators

Biographical TagsSenate

CommitteesPerception

Chuck Grassley

politics, senator, republican, iowa,

gop

health, food, agriculture

conservative

Claire McCaskill

politics, democrats, missouri, women

tech, security, power, health,

commerce

progressive, liberal

Jim Inhofepolitics, congress,

oklahoma, republican

army, energy, climate, foreign

conservative

John Kerrypolitics, senate,

democrats, bostonhealth, climate, tech progressive

Page 36: Extracting Relevant & Trustworthy Information from Microblogs

User feedback

Accurate

Informative

Total Evaluations

345 342

Response: Yes 274 277

Response: No 18 20

Can’t tell 53 45

Ignoring can’t tell responses,

• Accuracy – 94 %

• Informative – 93 %

Page 37: Extracting Relevant & Trustworthy Information from Microblogs

Evaluating inference coverage

What fraction of Twitter can our method of inference be applied to?

A large fraction of popular Twitter users are covered

Page 38: Extracting Relevant & Trustworthy Information from Microblogs

Evaluating inference coverage We could also infer attributes of less popular users

6% of users with Follower Ranks between 1 and 10 Million They are often experts on niche topics

User: Twitter bioFollowers

Listed

Inferred Attributes

spacespin: news on robotic space exploration

56 11science, space exploration, nasa, astronomy, planets

laithm: Al-jazeera network battle cameraman

201 16jounalists, photographer, al-jazeera, media

HumphreysLab: Stem Cell, Regenrative Biology of Kidney

119 17science, stem cell, genetics, cancer, physicians, biotech, nephrologist

Page 39: Extracting Relevant & Trustworthy Information from Microblogs

Cognos Search system for topic experts in Twitter

Given a query (topic) Identify experts on the topic using Lists Rank identified experts

Page 40: Extracting Relevant & Trustworthy Information from Microblogs

Ranking experts Used a ranking scheme solely based on Lists

Two components of ranking user U w.r.t. query Q Relevance of user to query – cover density ranking

between topic document TU of user and Q Popularity of user – number of Lists including the

user

Topic relevance(TU, Q) × log(#Lists including U)

Page 41: Extracting Relevant & Trustworthy Information from Microblogs

Cognos results for “stem cell”

Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

Page 42: Extracting Relevant & Trustworthy Information from Microblogs

Evaluation of Cognos

System deployed and evaluated ‘in-the-wild’

Evaluators were students & researchers from the three home institutes of authors

Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

Page 43: Extracting Relevant & Trustworthy Information from Microblogs

Cognos vs. Twitter Who-To-Follow

Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

Page 44: Extracting Relevant & Trustworthy Information from Microblogs

Cognos vs. Twitter Who-To-Follow Considering 27 distinct queries asked at least twice Judgment by majority voting

Cognos judged better on 12 queries Computer science, Linux, Mac, Apple, Ipad, Internet,

Windows phone, photography, political journalist, …

Twitter Who-To-Follow judged better on 11 queries Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter,

metallica, cloud computing, IIT Kharagpur, …

Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

Page 45: Extracting Relevant & Trustworthy Information from Microblogs

Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

Results for query music

Page 46: Extracting Relevant & Trustworthy Information from Microblogs

Summary: Finding topic experts in Twitter Developed and deployed Cognos

Uses Lists to infer topics of expertise and rank users Competes favorably with Twitter Who-To-Follow

Lists vital in searching for topic experts in Twitter

Future work Make the inference methodology robust against List

spam Key insight: Unlike follow-links, experts do not List

non-expert users Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

Page 47: Extracting Relevant & Trustworthy Information from Microblogs

Twitter microblogging site An important source for real-time Web content

500 million active users posting 400 million tweets daily

Quality of tweets / content vary widely Any one can post tweets

Celebrities, politicians, news media, academics, spammers

Challenge: Finding relevant & trustworthy content Trustworthy: Thwart spammers and their spam Relevance: Identify authoritative experts on specific

topics

Page 48: Extracting Relevant & Trustworthy Information from Microblogs

Higher-level take away Links mean different things in different real-

world social networks In fact, every social network offers different types of

links

They are backed by different social interactions Many links are implicit

Important to differentiate and leverage domain-specific usage of social links

Page 49: Extracting Relevant & Trustworthy Information from Microblogs

Thank You

You can try Cognos at:http://twitter-app.mpi-sws.org/whom-to-follow/http://twitter-app.mpi-sws.org/who-is-who/