Distributed Classification in Peer-to-Peer Networks

32
Distributed Classification in Peer- to-Peer Networks Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi Institute of Computing Technology, Chinese Academy of Sciences Presentation by: Satya Bulusu

description

Distributed Classification in Peer-to-Peer Networks. Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi. Institute of Computing Technology, Chinese Academy of Sciences. Presentation by: Satya Bulusu. Overview. Introduction Building Local Classifiers Distributed Plurality Voting - PowerPoint PPT Presentation

Transcript of Distributed Classification in Peer-to-Peer Networks

Page 1: Distributed Classification in Peer-to-Peer Networks

Distributed Classification in Peer-to-Peer Networks

Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi ShiInstitute of Computing Technology, Chinese Academy of Sciences

Presentation by: Satya Bulusu

Page 2: Distributed Classification in Peer-to-Peer Networks

03/27/2008

OverviewIntroduction

• Building Local Classifiers

• Distributed Plurality Voting

• Experimental Results

• Related Works

• Summary

Page 3: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Research Motivation• Widespread use of P2P networks and sensor networks• Data to be analyzed are distributed on nodes of these

large-scale dynamic networks• Traditional distributed data mining algorithms must

be extended to fit this new environment

• Motivating Examples - P2P anti-spam networks - Automatic organization of web documents in

P2P environments• A distributed classification algorithm is critical

in these applications.

Page 4: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Research Motivation contd…• New Challenges - highly decentralized peers, do not have the

notion of clients and servers - including hundreds or thousands of nodes, impossible global synchronization - frequent topology changes caused by frequent failure and recovery of peers

• Algorithm Requirements - scalability, decentralized in-network processing - communication efficient, local synchronism - fault-tolerance

Page 5: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Problem Formulation• Given:

A connected topology graph Each peer owns its local training data for

classification Local neighborhood change is informed to each peer

real-timely

• Find: Classification paradigm in this setting Including how to train and use a global classifier

• Objective: Scalability, communication-efficient, decentralized in-

network processing, fault-tolerance

• Constraints: Each peer can only communicate with its immediate

neighbors The network topology changes dynamically

( , )G U Eu U

Page 6: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Contributions from this paper• An algorithm to build an ensemble classifier for

distributed classification in P2P networks by plurality voting on all the local classifiers – Adapt the training paradigm of pasting bites

for building local classifiers– An algorithm of (restrictive) Distributed

Plurality Voting (DPV) to combine the decisions of local classifiers

Correctness Optimality

• Extensive Experimental Evaluation– Communication overhead and convergence

time of DPV– Accuracy comparison with centralized

classification

Page 7: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Overview• Introduction

Building Local Classifiers

• Distributed Plurality Voting

• Experimental Results

• Related Works

• Summary

Page 8: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Building Local Classifiers

• Pasting Bites by Breiman [JML’99]– Generating small bites of the data by

importance sampling based on the out-of-bag error of classifiers built so far

– Stopping criteria: difference of errors between two successive iteration is below a threshold

– Voting uniformly all the classifiers

• The more data on a local node, the more classifiers generated on it, the more votes it owns.

Page 9: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Overview• Introduction

• Building Local Classifiers

Distributed Plurality Voting

• Experimental Results

• Related Works

• Summary

Page 10: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Problem Formulation Of DPV• Given:

A group of peers in a graph would like to agree on one of options.

Each peer conveys its preference by initializing a voting vector , where is the number of votes on the i-th option.

Local neighborhood change is informed to each peer real-timely• Find:

The option with the largest number of votes over all peers:

• Objective: Scalability, communication-efficient, decentralized in-

network processing, fault-tolerance• Constraints:

Each peer can only communicate with its immediate neighbors

The network topology changes dynamically

u U

U ( , )G U Ed

u dP [ ]uP i

Page 11: Distributed Classification in Peer-to-Peer Networks

03/27/2008

An Example Of DPV

The third option is the answer.

Page 12: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Comparison Between DPV and Distributed Majority Voting (DMV, by Wolff et al. [TSMC’04])• DMV Given:

A group of peers in a graph Each peer conveys its preference by

kainitializing a 2-tuple , where stands for the number of the votes for certain option and stands for the number of the total vote on this peer.

The majority ratio • DMV Find:

Check whether the voting proportion of the specified option is above :

• DMV Converted to DPV: Replacing the 2-tuple on each peer

with the voting vector

u UU ( , )G U E

Page 13: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Comparison Between DPV and DMV contd…

• DPV vs. DMV DPV is a multi-values function while DMV is a binary

predicate. DMV can be solved by converting it to DPV. However, DMV can only solve 2-option DPV problems.

For a d-option DPV problem, pairwise comparisons among all d options must be performed by DMV for times (Multiple Choice Voting [TSMC’04]).

DPV finds the maximally supported option directly, and thus saves a lot of communication overhead and the time for convergence.

• DPV is the general form of DMV

Page 14: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Challenges for DPV

• No central server to add all voting vectors, Only communication between immediate neighbors

• Dynamic change of not only the network topology but also the local voting vectors

• Supporting not only one-shot query, but also continuous monitor the current voting result according to the latest network status

Page 15: Distributed Classification in Peer-to-Peer Networks

03/27/2008

DPV Protocol Overview• Assumption:

– it includes a mechanism to maintain an un-directional spanning tree for the dynamic P2P network. The protocol performs on this tree (duplicate insensitive).

– A node is informed of changes in the status of adjacent nodes.

• Protocol Overview Each node performs the same algorithm independently Specify how nodes initialize and react under different

situations: a message received, neighboring node detached or joined, the local voting vector changed

When the node status changes under the above situation, the node notifies this change to the other neighbors only if the condition for sending messages satisfies.

To guarantee that each node in the network converges toward the correct plurality

Page 16: Distributed Classification in Peer-to-Peer Networks

03/27/2008

The Condition for Sending Messages contd…

(5,2,1)+(2,0,0)=(7,2,1) 7-2=5 7-1=6 (8,6,1)+(2,0,0)=(10,6,1) 10-4=4 10-1=94<5 9>6The differences between the votes of maximally voted option and any other option decrease.

Message Sent

Page 17: Distributed Classification in Peer-to-Peer Networks

03/27/2008

The Condition for Sending Messages

(5,2,1)+(2,0,0)=(7,2,1) 7-2=5 7-1=6 (8,4,1)+(2,0,0)=(10,4,1) 10-4=6 10-1=96>5 9>6The differences between the votes of maximally voted option and all other options do not decrease.

No Message Sent

Page 18: Distributed Classification in Peer-to-Peer Networks

03/27/2008

The Correctness of DPV Protocol

All the nodes converge to the same result.

The difference between the actual votes of maximally voted option and any other option is not smaller than what the protocol have sent.

Then, all the nodes converge to the right result.

Page 19: Distributed Classification in Peer-to-Peer Networks

03/27/2008

The Optimality of DPV Protocol

C1 is more restrictive than C2, iff, for any input case if C1 is

true then C2 is true.

C1 is strictly more restrictive than C2, iff, C1 is more restrictive than C2 and there at least exists an input case such that C1

is false and C2 is true.

is the most restrictive condition for sending messages to keep the correctness of the DPV protocol.

It is the condition, which is the most difficult to satisfy. In this sense, it guarantees the optimality in communication overhead.

Page 20: Distributed Classification in Peer-to-Peer Networks

03/27/2008

The Extension of DPV Protocol

Restrictive Distributed Plurality Voting:

It can be used in a classification ensemble in a restrictive manner by leaving out some uncertain instances.

output the maximally voted option whose proportion to all the votes is above a user-specified threshold.

The new condition for sending messages is based on the spirit of .

Page 21: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Overview• Introduction

• Building Local Classifiers

• Distributed Plurality Voting

Experimental Results

• Related Works

• Summary

Page 22: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Accuracy of P2P Classification

Data: covtype (581012*54, 7 classes) from the UCI database, distributed onto 500 nodes

Page 23: Distributed Classification in Peer-to-Peer Networks

03/27/2008

The Performance of DPV Protocol

Experimental Parameters

Difference types of network topology: Power-law Graph, Random Graph, GridNumber of nodes: 500, 1000, 2000, 4000, 8000, 160007-option DPV problems

Experimental Metrics

The average communication overhead for each nodeThe convergence time of the protocol for one-shot query

Page 24: Distributed Classification in Peer-to-Peer Networks

03/27/2008

The Performance of DPV Protocol contd…

DPV0 vs. RANK (Multiple Choice Voting)500 nodesAveraging 2000 instances of 7-option plurality voting problems

a and b are the largest and second largest options, respectively.

Page 25: Distributed Classification in Peer-to-Peer Networks

03/27/2008 Ping Luo KDD 07

The Performance of DPV Protocol contd…

The Scalability of DPV0

Different number of nodes vs. communication overhead of each node

Page 26: Distributed Classification in Peer-to-Peer Networks

03/27/2008Ping Luo KDD 07

The Performance of DPV Protocol (4)

The Local Optimality of DPV0Communication overhead and convergence time under different conditions for sending messages

Page 27: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Overview• Introduction

• Building Local Classifiers

• Distributed Plurality Voting

• Experimental Results

Related Works

• Summary

Page 28: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Related Work - Ensemble Classifiers• Model Combination: (weighted) voting, meta-learning• For Centralized Data

– applying different learning algorithms with heterogeneous models

– applying a single learning algorithm to different versions of the data

Bagging: random sampling with replacement Boosting: re-weighting of the mis-classified training

examples Pasting Bites: generating small bites of the data by

importance sampling based on the quality of classifiers built so far

• For Distributed Data– distributed boosting by Lazarevic et al. [sigkdd’01]– distributed approach to pasting small bites by Chawla et

al. [JMLR’04], which uniformly votes hundreds or thousands of classifiers built on all distributed data sites

Page 29: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Related Work - P2P Data Mining• Primitive Aggregates

– Average– Count, Sum– Max, Min– Distributed Majority Voting by Wolff et al. [TSMC’04]

• P2P Data Mining Algorithms– P2P Association Rule Mining by Wolff et al.

[TSMC’04]– P2P K-means clustering by Datta et al. [SDM’06]– P2P L2 threshold monitor by Wolff et al. [SDM’06]– Outlier detection in wireless sensor networks by

Branch et al. [ICDCS’06]– A classification framework in P2P networks by

Siersdorfer et al. [ECIR’06] Limitations: local classifiers’ propagations, experiments

on 16 peers, only focusing on the accuracy issue, without involving any dynamism of P2P networks.

Page 30: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Overview• Introduction

• Building Local Classifiers

• Distributed Plurality Voting

• Experimental Results

• Related Works

Summary

Page 31: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Summary

Proposed an ensemble paradigm for distributed classification in P2P networksFormalized a generalized Distributed Plurality Voting (DPV) protocol for P2P networksThe property of DPV0

supporting both one-shot query and continuous monitortheoretical local optimality in terms of communication overheadoutperforms alternative approachesscale up to large networks

Page 32: Distributed Classification in Peer-to-Peer Networks

03/27/2008

Q. & A.

Acknowledgement