PhD Thesis presentation

DETECTION OF DISHONEST BEHAVIORS IN ON-LINE NETWORKS USING GRAPH-BASED RANKING TECHNIQUES

Francisco Javier Ortega Rodríguez

Supervised by

Prof. Dr. José Antonio Troyano Jiménez

Motivation

WWW: Web Search A new business model

Advertisements on the web pages More web traffic More visits to (or views of)

the ads

Search Engine Optimization (SEO) is born

White Hat SEO

Black Hat SEOWeb Spam!

Social Networks Reputation of users similar to relevance of web

Higher reputation can imply some benefits

Malicious users manipulate the TRS’s On-line marketplaces: money Social news sites: slant the contents of the web site Simply for “trolling” (for pleasure)

Motivation

Motivation5

Motivation

Hypothesis

The detection of dishonest behaviors in on-line networks can be carried out with graph-based techniques, flexible enough to include in their schemes specific information (in the form of features of the elements in a graph) about the network to be processed and the concrete task to be solved.

Detection of

Dishonest

Behaviors

Web Spam

Detection

Trust &

Reputation

in Social

Networks

State-of-

Conclusio

PolaritySp

State-of-

PolarityTru

Roadmap

Web Spam Detection

Web spam mechanisms try to increase the web traffic to specific web sites

Reach the top positions of a web search engine Relatedness: similarity to the user query

Changing the content of the web page

Visibility: relevance in the collection Getting a high number of references

Web Spam Detection

Content-based methods: self promotion Hidden HTML code Keyword stuffing

Web Spam Detection

Link-based methods: mutual promotion Link-farms PR-sculpting

Web Spam

Detection

State-of-

Roadmap

methods

Content-

methods

Hybrid

methods

Web Spam Detection

Relevant web spam detection methods: Link-based approaches

PageRank-based

Adaptations: Truncated PageRank [Castillo et al. 2007] TrustRank [Gyongy et al. 2004]

Web Spam Detection

Relevant web spam detection methods: Link-based approaches

Pros: Tackle the link-based spam methods The ranking can be directly used as the result of an

user query

Cons: Do not take into account the content of the web

pages Need human intervention in some specific parts

Web Spam Detection

Relevant web spam detection methods: Content-based approaches

WP’s

Compressibility

Avg. word lenght

1 … … … …

2 … … … …

… … … … …

Classifier

SpamNot

Database

Web Spam Detection

Relevant web spam detection methods: Content-based approaches

Pros: Deal with content-based spam methods Binary classification methods

Cons: Very slow in comparison to the link-based methods Based on user-specified features Do not take into account the topology of the web

Web Spam Detection

Relevant web spam detection methods: Hybrid approaches

WP’s Size

% In-links

Compressibility

Out-links / In-links

Avg. word lenght

1 … … … … … …

2 … … … … … …

… … … … … … …

Link-based metrics

Database

Web Spam Detection

Relevant web spam detection methods: Hybrid approaches

Pros: Combine the pros of link and content-based

methods. Really effective in the classification of web pages

Cons: Need user-specified features for both the content

and the link-based heuristics.

Opportunity: Do not take advantage of the global topology of

the web graph

Web Spam

Detection

PolaritySp

Roadmap

Content

Evaluation

Selection

sources

Evaluatio

Propagati

algorithm

PolaritySpam

Intuition Include content-based knowledge in a link-

based system.

Selection of sources

Content Evaluation

Propagation

algorithm

Ranking

Database

PolaritySpam

Content Evaluation

Database

PolaritySpam

Content Evaluation Acquire useful knowledge from the textual

content

Content-based heuristics Adequate for spam detection Easy to compute Highest discriminative ability

A-priori spam likelihood of a web page

PolaritySpam

Content Evaluation Small set of heuristics [Ntoulas et al., 2006]

Compressibility

Average length of words

A high value of the metrics implies an a-priori high spam likelihood of a web page

PolaritySpam

Selection of Sources

Selection of sources

Database

PolaritySpam

},...,,{ 21 ijiii mmmM

Selection of Sources Automatically pick a set of a-priori spam and

not-spam web pages, Sources- and Sources+, respectively

Take into account the content-based heuristics Given a web page wpi with metrics:

PolaritySpam

Selection of Sources Most Spamy/Not-Spamy sources (S-NS)

Content-based S-NS (CS-NS)

Content-based Graph Sources (C-GS)

Sources

PolaritySpam

Propagation algorithm

Propagation

algorithm

Ranking

PolaritySpam

Propagation algorithm: PageRank-based algorithm

Idea: propagate a-priori information from a specific set of web pages, Sources

A-priori scores for the Sources

PolaritySpam

Propagation algorithm: Two scores for each web page, vi:

Set of a-priori non-spam web pages

Set of a-priori spam web pages

PolaritySpam

Propagation

algorithm

Ranking

PolaritySpam

Evaluation: Dataset

Baseline

Evaluation Methods

Results

PolaritySpam

Evaluation: Dataset

WEBSPAM-UK 2006 (Università degli Studi di Milano)

Metrics: 98 million pages 11,400 hosts manually labeled

7,423 hosts are labeled as spam About 10 million web pages are labeled as spam

Processed with Terrier IR Platform http://terrier.org

PolaritySpam

Evaluation: Baseline: TrustRank [Gyongy et al., 2004]

Link-based web spam detection method

Personalized PageRank equation

Propagation from a set of hand-picked web pages

PolaritySpam

Evaluation: Evaluation methods: PR-Buckets

Bucket 1

Bucket 2

Bucket N

PolaritySpam

Evaluation metric: number of spam web pages in each bucket

Bucket 1

Bucket 2

Bucket N

PolaritySpam

Evaluation metric: number of spam web pages in each bucket

Bucket 1

Bucket 2

Bucket N

PolaritySpam

Evaluation: Normalized Discounted Cumulative Gain

(nDCG): Global metric: measures the demotion of spam

web pages

Sum the “relevance” scores of not-spam web pages

PolaritySpam

web pages

PolaritySpam

web pages

PolaritySpam

Evaluation: PR-Buckets evaluation

S-NS and CS-

NSUsing the 5% of

web pages with

highest and

lowest Spaminess

PolaritySpam

Evaluation: nDCG evaluation

TrustRank

0.7381

S-NS 0.4230

CS-NS 0.8621

C-GS 0.8648

Our proposals

CS-NS and C-

GS outperforms

TrustRank in

the demotion

of spam web

1 2 3 4 5 6 7 8 9 101

AverageLength Compressibility AllMetrics TrustRank PolaritySpam

Buckets

a ge s

PolaritySpam

Evaluation: Content-based heuristics

The content-

based metrics

do not achieve

good results by

themselves

Detection of

Dishonest

Behaviors

Web Spam

Detection

Trust &

Reputation

in Social

Networks

State-of-

Conclusio

PolaritySp

State-of-

PolarityTru

Roadmap

Trust & Reputation in Social Networks

Trust and reputation are key concepts in social networks Similar to the relevance of web pages in

the WWW

Reputation: assessment of the trustworthiness of a user in a social network, according to his behavior and the opinions of the other users.

Example: On-line marketplaces

Trustworthiness as determining as the price Higher reputation implies more sales Positive and negative opinions

Main goal: gain high reputation Obtain positive feedbacks from the customers

Sell some bargains Special offers

Give negative opinions for sellers that can be competitors.

Obtain false positive opinions from other accounts

(not necessarily other users).

Dishonest behaviors!

Roadmap

Trust &

Reputation

in Social

Networks

State-of-

TRS’s in

real World

Threats

for TRS’s

Transitivit

Distrust

TRS’s in real world Moderators

Special set of users with specific responsibilities

Example: Slashdot.org A hierarchy of moderators A special user, No_More_Trolls, maintains a list of

known trolls

Drawbacks: Scalability Subjectivity

TRS’s in real world Unsupervised TRS’s

Users rate the contents of the system (and also other users)

Scalability problem: rely on the users Subjectivity problem: decentralized

Examples: Digg.com, eBay.com

Drawbacks: Unsupervised!

Transivity of Trust and Distrust [Guha et al., 2004] Multiplicative distrust

The enemy of my enemy is my friend

Additive distrust Don’t trust someone not trusted by someone you don’t trust

Neutral distrust Don’t take into account your enemies’ opinions

Threats of TRS’sOrchestrated attacks

Camouflage behind good behavior

Malicious Spies

Camouflage behind judgments

Threats of TRS’s Orchestrated attacks: obtaining positive

opinions from other accounts (not necessarily other users)

Threats of TRS’s Camouflage behind good behavior:

feigning good behavior in order to obtain positive feedback from others.

Threats of TRS’s Malicious spies: using an “honest” account to

provide positive opinions to malicious users.

Threats of TRS’s Camouflage behind judgments: giving

negative feedback to users who can be competitors.

Trust &

Reputation

in Social

Networks

PolarityTru

Roadmap

Algorithm

Negative

Propagatio

Evaluatio

Action-

Reaction

Propagation

Intuition Compute a ranking of the users in a social

network according to their trustworthiness

Take into account both positive and negative feedback

Graph-based ranking algorithm to obtain two scores for each node: : positive reputation of user i : negative reputation of user i

PolarityTrust

)( ivPT

Intuition Propagation algorithm for the opinions of the users

Given a set of trustworthy users Their PT+ and PT- scores are propagated to their

neighbors … and so on

PolarityTrust

Algorithm Propagation schema of the opinions of the

users Different behavior depending on the type of

relation between the users: positive or negative

PolarityTrust

fPT⁺(a) PT⁻(d)

PT⁺ (b) ↑

PT⁻(c) ↑ PT⁺(f) ↑

PT⁻(e) ↑

Algorithm The scores of the nodes influence the

scores of their neighbors

PolarityTrust

)( ivPT

PolarityTrust

Set of sources

dedvPT ii )1()(

PolarityTrust

Direct relation with the PT+ of positively voting users

)1()(iInj

vOutkjk

ijii vPT

pdedvPT

PolarityTrust

)1()(iInj

vOutkjk

ijii vPT

pdedvPT

Inverse relation with the PT- of negatively voting users

PolarityTrust

)1()(iInj

vOutkjk

ijii vPT

pdedvPT

)1()(iInj

vOutkjk

ijii vPT

pdedvPT

Non-Negative Propagation Problems caused by negative opinions from

malicious users

Solution: dynamically avoid the propagation of these opinions from malicious users

PolarityTrust

PR⁻(a)

PR⁺(c) ↑

PR⁻(b) ↑

Action-Reaction Propagation Problems caused by dishonest voting

attacks Positive votes to malicious users

Orchestrated attacks, malicious spies… Negative votes to good users

Camouflage behind bad judgments

React against bad actions: dishonest voting Penalize users who performs these actions Proportional to the trustworthiness of the nodes

been affected

PolarityTrust

Action-Reaction Propagation Computation:

Relation between the number of dishonest votes and the total number of votes

Applied after each iteration of the ranking algorithm

PolarityTrust

Complete Formulation

PolarityTrust

Evaluation Datasets

Baselines

Results

PolarityTrust

Evaluation Datasets

Barabasi & Albert Preferential attachment property

Randomly generated attacks

Metrics of the dataset 104nodes per graph 103 malicious users 100 malicious spies

PolarityTrust

Evaluation Datasets

Slashdot Zoo Graph of users in Slashdot.org Friend and Foe relationships Gold Standard = list of Foes of the special user

No_More_Trolls

Metrics of the dataset 71,500 users in total 24% negative edges 96 known trolls Source set: CmdrTaco and his friends 6 users in

PolarityTrust

Evaluation Baselines

EigenTrust [Kamvar et al. 2003] It does not take into account negative opinions

Fans Minus Freaks (Number of friends – Number of foes)

Signed Spectral Ranking [Kunegis et al. 2009]

Negative Ranking [Kunegis et al. 2009]

PolarityTrust

Evaluation Results: Randomly generated datasets

PolarityTrust

Threats

ET FmF SR NR PTNN PTAR PT

A 0.833 0.843

0.876 0.906 0.987

AB 0.833 0.844

0.876 0.906 0.987

ABC 0.842 0.719

0.877 0.903 0.984

ABCD 0.823 0.723

0.879 0.903 0.984

ABCDE 0.753 0.777

0.966 0.862 0.982

A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviorsD: Malicious spiesE: Camouflage behind judgments

ET: EigenTrust FmF: Fans Minus FreaksSR: Spectral RankingNR: Negative Ranking

PTNN: Non-Negative PropagationPTAR: Action-Reaction PropagationPT: PolarityTrust

PolarityTrust

outperforms

the baselines

in terms of

demotion of

malicious

Evaluation Results: Slashdot Zoo dataset

PolarityTrust

nDCG 0.310

0.460 0.479 0.477 0.593

0.570 0.588

Similar performance

with a real-

world dataset

Evaluation Results: Trolling Slashdot

PolarityTrust

Threats

A 0.310

0.570 0.588

AB 0.308

0.570 0.588

ABC 0.311

0.570 0.588

ABCD 0.370

0.580 0.570 0.586

ABCDE 0.370

0.580 0.574 0.588

A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviorsD: Malicious spiesE: Camouflage behind judgments

Significant

improvement

in the demotion of

malicious

Evaluation Include a set of sources of distrust

In Slashdot Zoo Dataset:

Sources of trust: CmdrTaco and friends

Sources of distrust: 5 random foes of No_More_Trolls

Many possible methods to choose the sources of distrust

PolarityTrust

Evaluation Results: Sources os trust and distrust

PolarityTrust

Sources of Trust Sources of Trust & Distrust

Threats

PTNN PTAR PT PTNN PTAR PT

A 0.593

0.846 0.790 0.846

AB 0.593

0. 846 0.790 0.846

ABC 0.593

0.846 0.790 0.846

ABCD 0.580

0.775 0.739 0.782

ABCDE 0.580

0.774 0.741 0.781

A: No estrategies B: Orchestrated attackC: Camouflage behind good behaviors

D: Malicious spiesE: Camouflage behind judgments

Improvement in

the demotion of

malicious users

with the sources

of distrust

Detection of

Dishonest

Behaviors

Web Spam

Detection

Trust &

Reputation

in Social

Networks

State-of-

Conclusio

PolaritySp

State-of-

PolarityTru

Roadmap

RemarksFuture

Conclusions

Final Remarks

Development of two systems for the detection of dishonest behaviors in on-line networks

Web Spam Detection: PolaritySpam Trust and Reputation: PolarityTrust

Propagation of some a-priori information

Web Spam: Textual content of the web pages Trust and Reputation: Trust and distrust sources

Conclusions

Final Remarks

Web Spam Detection

Unlike existent approaches, include content-based knowledge into a link-based technique

Unsupervised methods for the selection of sources

Propagate information of the sources through the network

Two simple metrics improve state-of-the-art methods

Conclusions

Final Remarks

Trust and Reputation in social networks

Negative links improve the discriminative ability of TRS’s

Propagation estrategies to deal with different attacks against a TRS

Non-Negative propagation Action-Reaction propagation

Interrelated scores modeling the transitivity of trust and distrust

Flexible to be adapted to different situations and threats

Conclusions

Future Work

PolaritySpam Applicability of more content-based metrics

Aditional methods for the selection of sources Propagation ability of each source

Infer negative relations between web pages According to their textual content Apply similar propagation schemas as in

PolarityTrust

Conclusions

Future Work PolarityTrust

Study other possible attacks Playbook sequences (omniscience of the attackers) Analyze the casuistry of the different social networks

Selection of sources of trust and distrust Link-based methods

Study other contexts with positive and negative relations:

Trending topics Authorities in the blogosphere

Conclusions

Future Work Both techniques

Study of the parallelization of both algorithms Many works on the parallelization of PageRank Saving time and memory

Detection of Spam on the social networks Spam messages and spam user accounts

Recommender Systems NLP and Opinion Mining techniques in a link-based

system Use the positive and negative information

Curriculum Vitae

Academic and Research milestones 2006: Degree on Computer Science

2006: Funded Student in the Itálica research group

2008: Master of Advances Studies: “STR: A graph-based tagger generator”

2010: Research stay at the University of Glasgow IR Group (Dr. Iadh Ounis and Dr. Craig Macdonald)

Curriculum Vitae

26 contributions to conferences and journals 5 JCR 10 International Conferences 2 CORE B 4 CORE C 4 ISI Proceedings 3 Lecture Notes in Computer Sciences 3 CiteSeer Venue Impact Ratings

Proyectos de investigación

Curriculum Vitae86

System Combination Methods

TexRank for Tagging

PolarityRank

PolaritySpam

PolarityTrust

Web Spam DetectionImproving a

Tagger Generator in IE

Contributions related to the thesis

National Conf.International Conf.JCR

Curriculum Vitae87

System Combination Methods

TexRank for Tagging

Improving a Tagger Generator in IE

Bootstrapping Applied to a Corpus Generation Task, EUROCAST 2007

TextRank como motor de aprendizaje en tareas de etiquetado, SEPLN 2006

Improving the Performance of a Tagger Generator in an Information Extraction Application, Journal of Universal Computer Science (2007)

Curriculum Vitae88

STR: A Graph-based Tagging Technique, International Journal on Artificial Intelligence Tools (2011)

Curriculum Vitae89

A Knowledge-Rich Approach to Featured-based Opinion Extraction from Product Reviews, SMUC 2010 (CIKM 2010)

Combining Textual Content and Hyperlinks in Web Spam Detection, NLDB 2010

Contributions related to the thesisPolarityRank

Web Spam Detection

Curriculum Vitae90

PolarityTrust: Measuring Trust and Reputation in Social Networks, ITA 2011PolaritySpam: Propagating Content-based Information Through a Web Graph to Detect Web Spam, International Journal of Innovative Computing, Information and Control (2012)

PolaritySpam

PolarityTrust

DETECTION OF DISHONEST BEHAVIORS IN ON-LINE NETWORKS USING GRAPH-BASED RANKING TECHNIQUES

Francisco Javier Ortega Rodríguez

Supervised by

Prof. Dr. José Antonio Troyano Jiménez

PhD Thesis presentation

Technology

Transcript of PhD Thesis presentation

Graduate Study Manual - Northwestern University · presentation and speaking. MS Thesis oral defense; PhD Prospectus oral exam; PhD Thesis oral defense; Presentations in seminars

My PhD thesis defense presentation

Olga Ivina PhD thesis presentation short

PhD thesis presentation

PhD Thesis Presentation 9Oct07 Final

Mulatu PhD Thesis Oral presentation

PhD thesis presentation Statistical Learning Theory: A PAC ...

PhD thesis presentation 2012

Presentation | PhD Thesis Strategy Implementation

PhD Thesis Presentation Mubashir

Parolini PhD Thesis - core.ac.uk fileParolini PhD Thesis - core.ac.uk

PhD Thesis

PhD Thesis Presentation, 1

Motion Planning for Multiple Autonomous Vehicles, PhD Thesis Defense Presentation

Najmul Hoda PhD Thesis Defence Presentation

PhD Thesis Presentation UAM for LinkedIn

PhD Thesis presentation - Vladimir Gutierrez - Spain 2014DECEMBER

My PhD thesis presentation slides

Final Phd Thesis Presentation

Thesis Examination Reportijbssnet.com/journals/Vol_8_No_10_October_2017/22.pdfKeywords: PhD thesis examination report, writing problems, research fundamentals‟ problems, presentation