Using Propagation of Distrust to find Untrustworthy Web Neighborhoods Panagiotis Takis Metaxas...

38
Using Propagation of Distrust to find Untrustworthy Web Neighborhoods Panagiotis Takis Metaxas Computer Science Department Wellesley College, USA ICIW2009 – Venice, Italy (May)

Transcript of Using Propagation of Distrust to find Untrustworthy Web Neighborhoods Panagiotis Takis Metaxas...

UsingPropagation of Distrust

to find Untrustworthy Web

Neighborhoods

Panagiotis Takis MetaxasComputer Science Department

Wellesley College, USA

ICIW2009 – Venice, Italy (May)

Outline of the Talk

The role of Search Engines in Web experience

What is Web Spam

Why Search Engines evolve?

Web Graph vs. Societal Trust

Evolution of Search Engines: 1993-2003

Backward Propagation of Distrust

Have you used the Web…

to get informed?to help you make decisions?

Financial Medical Political Religious…

The Web is huge> 1 trillion (! ?) static pages publicly available, … and growing every dayMuch larger, if you count the “deep web”Infinite, if you count pages created on-the-fly

We depend on search engines to find information

We depend on search engines to find information

Web information can be unreliable

Anyone can be an author on the web!

You know e-mail Spam…

The Web has Spam too!

Search results steroid drug HGH (human growth hormone)Search results steroid drug HGH (human growth hormone)

Any controversial issue will be spammed

Search results for mental disease ADHD (attention-deficit/hyperactivity disorder)Search results for mental disease ADHD (attention-deficit/hyperactivity disorder)

Political issues will be spammed

Search results for Senatorial candidate John N. Kennedy, 2008 USA ElectionsSearch results for Senatorial candidate John N. Kennedy, 2008 USA Elections

… you like it or not!

But Google is usually so good in finding info… Why does it do that?

Famous search results for “miserable failure”Famous search results for “miserable failure”

Why?

Web Spam: Attempt to modify the web (its structure and contents),

and thus influence search engine results in ways beneficial to web spammers

crawl theweb

createinverted index

Inverted index

Search engine servers

DocumentIDs

How Google (and the other search engines) Work

userquery

THEWEB

Rankresults

A Brief History of Search Engines

1st Generation (ca 1994): AltaVista, Excite, Infoseek… Ranking based on Content:

Pure Information Retrieval

2nd Generation (ca 1996): LycosRanking based on Content + Structure

Site Popularity

3rd Generation (ca 1998): Google, Teoma, Yahoo Ranking based on Content + Structure + Value

Page Reputation

In the Works Ranking based on “the user’s need behind the query”

1st Generation: Content Similarity

Content Similarity Ranking:The more rare words two documents share, the more similar they are

Documents are treated as “bags of words”(no effort to “understand” the contents)

Similarity is measured by vector angles

Query Results are rankedby sorting the anglesbetween query and documents

How To Spam? t 1

d

2

d 1

t 3

t 2

θ

1st Generation: How to Spam

“Keyword stuffing”:Add keywords, text, to increase content similarity

Page stuffed with casino-related keywords

2nd Generation: Add Popularity

A hyperlink from a page in site A to some page in site Bis considered a popularity vote from site A to site B

Rank similar documents according to popularity

How To Spam?

www.aa.com1

www.bb.com2

www.cc.com1 www.dd.com

2

www.zz.com0

2nd Generation: How to Spam

Create “Link Farms”:Heavily interconnected owned sites spam popularity

Interconnected sites owned by vespro.com promote main site

3rd Generation: Add Reputation…

The reputation “PageRank” of a page Pi =

the sumof a fraction of the reputations

of all pages Pj that point to Pi

Idea similar to academic co-citations

Beautiful Math behind it PR = principal eigenvector

of the web’s link matrix PR equivalent to the chance

of randomly surfing to the page

HITS algorithm tries to recognize “authorities” and “hubs”

How To Spam?

3rd Generation: How to Spam

Organize Mutual Admiration Societies:“link farms” of irrelevant reputable sites

Mutual Admiration Societies via Link Exchange

An Industry is Born“Search Engine Optimization” CompaniesAdvertisement ConsultantsConferences

3rd Generation: Reputation & Anchor Text

Anchor text tells you what the reputation is about

How To Spam?

Page A Page B

Anchor

www.ibm.comArmonk, NY-based computergiant IBM announced today

Joe’s computer hardware linksCompaqHPIBM

Big Blue today announcedrecord profits for the quarter

“Google-bombs” spam Anchor Text…

Business weapons “more evil than satan”

Political weapon in pre-election season “miserable failure” “waffles” “Clay Shaw” (+ 50 Republicans)

Misinformation Promote steroids Discredit AD/HD research

Activism / online protest “Egypt” “Jew”

Other uses we do not know?“views expressed by the sites in your results are not in any way endorsed by Google…”

… mostly for political purposes

Activists openly collaborating to Google-bomb search results of political opponents in 2006

“miserable failure hits Obama in January 2009

Search Engines vs Web Spam

Search Engine’s Action

1st Generation: Similarity Content

2nd Generation: + Popularity

Content + Structure

3rd Generation: + Reputation+ Anchor Text

Content + Structure + Value

4th Generation (in the Works) Ranking based on the user’s

“need behind the query”

Web Spammers Reaction

Add keywords so as to increase content similarity+ Create “link farms” of heavily interconnected sites+ Organize “mutual admiration societies” of irrelevant reputable sites

+ Googlebombs

??

Is there a pattern on how to spam?

Can you guess what they will do?

And Now For Something Completely(?) Different

Propaganda: Attempt to modify human behavior,

and thus influence people’s actions in ways beneficial to propagandists

Theory of Propaganda Developed by the Institute for Propaganda Analysis 1938-42

Propagandistic Techniques (and ways of detecting propaganda)

Word games - associate good/bad concept with social entity Glittering Generalities — Name Calling

Transfer - use special privileges (e.g., office) to breach trustTestimonial - famous non-experts’ claims Plain Folk - people like us think this wayBandwagon - everybody’s doing it, jump on the wagonCard Stacking - use of bad logic

Societal Trust is (also) a Graph

Propaganda:Attempt to modify the Societal Trust Graph and thus influence peoplein ways beneficial to propagandist

Web Spam:Attempt to modify the Web Graph, and thus influence users through search engine results in ways beneficial to web spammers

Web Spammers as Propagandists

Web Spammers can be seen as employing propagandistic techniques

in order to modify the Web Graph

There is a pattern on how to spam!

Propaganda in Graph Terms

Word Games Name Calling Glittering Generalities

TransferTestimonialPlain FolkCard stackingBandwagon

Modify Node weights Decrease node weight Increase node weight

Modify Node content + keep weights

Insert Arcs b/w irrelevant nodes

Modify ArcsMislabel ArcsModify Arcs& generate nodes

Anti-Propagandistic Lessons for Web

How do you deal with propaganda in real life?

Backwards propagation of distrust

The recommender of an untrustworthy message becomes untrustworthy

Can you transfer this technique to the web?

An Anti-Propagandistic Algorithm

Start from untrustworthy site sS = {s}Using BFS for depth D do:

Find the set U of sites linking to sites in S (using the Google API for up to B b-links/site)

Ignore blogs, directories, edu’s S = S + U

Find the bi-connected component BCC of U

that includes s

BCC shows multiple paths to boost the reputation of s

Backwards Propagation of Distrust

Start from untrustworthy site sS = {s}Using BFS for depth D do:

Find the set U of sites linking to sites in S (using the Google API for up to B b-links/site)

Ignore blogs, directories, edu’s S = S + U

Find the bi-connected component BCC of U

that includes s

BCC shows multiple paths to boost the reputation of s

BCC vs Periphery

Since the BCC reveals multiple paths to boost the reputation of s, we expect it to contain a higher percentage of untrustworthy sites

The Periphery of the BCC, on the other hand, should have significantly lower percentage of untrustworthy sites

BCC

Periphery

Explored neighborhoods

Evaluated Experimental Results

The trustworthiness of starting site is a very good predictor for the trustworthiness of BCC sites

The BCC is significantly more predictive of untrustworthiness than the Periphery

BCC

Periphery

Living in Cyberspace

Critical Thinking, Education Realize how do we know what we know “Of course it’s true; I saw it on the Internet!”

Cyber-social Structures that mimic Societal ones Know why to trust or distrust Who do you trust on a particular subject?

A Search Engine per Browser Easier to fool one search engine than to fool millions of

readers Enable the reader to keep track of her trust network Tools of cyber trust How would you avoid the Emulex hoax?

Thank You!http://cs.wellesley.edu/~pmetaxas

How (not) To Solve The Problem

Link Farms vs MAS