Dynamics of Peer-to-Peer Networks or Who is Going to be The Next Pop Star? Yuval Shavitt School of...

Post on 01-Apr-2015

213 views 0 download

Transcript of Dynamics of Peer-to-Peer Networks or Who is Going to be The Next Pop Star? Yuval Shavitt School of...

Dynamics of Peer-to-Peer Networks or

Who is Going to be The Next Pop Star?

Yuval ShavittSchool of Electrical Engineering

shavitt@eng.tau.ac.ilhttp://www.eng.tau.ac.il/~shavitt

Credits

Talk is based on the papers:• Static and dynamic characterization of the

Gnutella network [Shaked-Gish, S, Tankel, IPTPS 2007]

• How to predict the next pop star? [Koenigstein, S, Tankel, KDD 2008]

What are Peer-to-Peer Networks?

• The common computing paradigm is client-server– Server waits for requests (on a

known port)– Client sends a request– Server serves the client– Examples: WWW, FTP, SMTP (e-

mail), …..

• Peer-to-peer networks:– Each end-point is both client and

server

client client

client client

client client

client clientserver

The Gnutella Network

• Gnutella: The most popular sharing network on the Internet

• According to the Digital Music News Research Group 40% market share in Q4 2007

• Limewire: The most popular file sharing client in the world. Dominates the Gnutella network.

The Gnutella Protocol

• Originally: a flat peer-to-peer distributed protocol.– Churn caused instability

• Today: a 2-level tiered system – Stable nodes are promoted to become ultrapeers– Queries carry OOB address:

The originator’s address or in most cases when the client is firewalled, this is the ultrapeer’s address

Locating the Origin IP address

IP resolution Process:

• Detect the U.P. IP• Discard queries with

more than 2 hops• Discard queries with

2 hops and same IP• Intercept queries

with 2 hops and different IPs

peer peer

UPUPUP listener

peer

Cancels the bias for rare queries

Introduces bias against firewalled clients

Data Sets• First study:

– Jul 2006 - Nov 2006– 665,000,000 world-wide geo-identified queries

• Second study– Oct 2006 – Jul 2007, Sundays only– 310,000,000 USA geo-identified queries

• A network crawl of 24 hours– 1.2M users– 533,000 different songs

Largest studies ever performedin length and depth

Query Classification in Gnutella

Music (68.11%) Adult (22.01%)

Movie (4.1%) TV (1.7%)

Unknow n (1.67%) Japanese Anime/Comic (1.37%)

Softw are (0.54%) File Suff ix (0.26%)

Spam (0.23%)

2nd

Top Coutries

Queries Per Day

Queries Per Hour Per User

Top Queries (constant)

Top Volatile Queries

Temporal Ranking Drift

How to Predict Artist’s Success?

Noam Koenigstein, Y. Shavitt, and Tomer Tankel. Spotting Out Emerging Artists Using Geo-Aware Analysis of P2P Query Strings. The 2008 ACM SIGKDD Conference, August 2008, Las Vegas, NV, USA.

The Word of Mouth Effect

A successful innovation formation of adopter-clusters around early adopters

unsuccessful product a uniform spatial distribution

The Divergence can be used to predict a new product success probability [Garber et al., Marketing Science 2004]

The divergence

• When measured against the uniform distribution, maximum is achieved when P is a function.– True for both Kullback-Leiblar and Jensen-

Shannon– This is the case when emerging artists are

considered

• Non uniform distribution of potential adopters:

Party Like a Rockstar in 2007Week 6: The string “party like a rockstar” is detected by the algorithmWeek 8: Atlanta’s popularity chart in (Feb 18th)Week 15: Atlanta based Shop Boyz sign contract with Universal RecordingsWeek 18: The song first enters the Billboard Hot 100 on (80th position)Week 23: Reached 2nd position on Billboard Hot 100

Ranked only10,156on the

global chart

Party Like a Rockstar

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Week Numbers (2007)

Div

erg

en

ce

0.00E+00

1.00E-02

2.00E-02

3.00E-02

4.00E-02

5.00E-02

6.00E-02

7.00E-02

8.00E-02

Po

pu

lari

ty

KL Divergence

PopularityShop Boyz related queries in February 2007

Shop Boyz Popularity and Divergence in 2007

Soulja Boy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Week Numbers (2007)

Div

erg

ence

0.00E+00

1.00E-02

2.00E-02

3.00E-02

4.00E-02

5.00E-02

6.00E-02

KL Divergence

Popularity

• Detected by our alg:already in 2006.

•The string “soulja boy” entered the “Atlanta queries top 100” already in October 2006

• Entered the Bubbling Under R&B/Hip-Hop Singles in the 23rd of June 2007•Later ranked first in the following Billboard charts:Hot 100, Hot Rap Tracks, Hot Videoclip, Hot RingMasters and Hot Ringtones

Yung Berg

• Active in LA

• Week 2: Entered LA top 100

• Week 15: First appeared on the Billboard charts

• Week 32: Reached 18 on the Billboard Top 100

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Week Numbers (2007)

Div

erg

ence

0.00E+00

2.00E-03

4.00E-03

6.00E-03

8.00E-03

1.00E-02

1.20E-02

1.40E-02

1.60E-02

Po

pu

lari

ty

KL Diveregence

Popularity

Madonna

The Detection Algorithm• Input: A list of Geo-identified P2P Query strings

Output: A list of locally popular query string with high probability to become globally popular

• Build local and global popularity charts

• local popularity is detected using local and global popularity thresholds

• Looking for local popularity growth trends from week to week

• Filtering:Non-music related content, and already familiar artists are characterized by uniform distribution

Local Popularity

• Not all queries are “products”, thus divergence is not effective (e.g., rare typos)

• Detection is based on local popularity:

ATPL - All Times Popular List• Initialization: All the strings that reached global popularity in

2006

• Weekly aggregation

• Filters non-volatile string: • adult related, e.g., “porn” • well established artists, e.g., “madonna”, “avril lavigne”• Movies, software, etc.

Algorithm's Flow

Detection Time

Local Threshold

Local Threshold

Manual inspection of the Atlanta data

Correlation Between Billboard and downloads

Correlation Measurements

• Modified time series correlation

• P2P correlation with the Billboard:

Finding The Optimal Time Shift

Prediction Results

• Example:When a song enters the Billboard will it reach “top 20”?

• Precision: 89%, Recall: 80%On average songs pass the threshold 2.83 weeks before reaching top Billboard rank

• More details:Koenigstein, Shavitt, and Zilberman, AdMIRe 2009

Summary

• Following activity in the Internet can help up detect trends before they are visible– P2P networks– Social networks– Blogs– Talk-backs– Searches

• More at http://www.eng.tau.ac.il/~shavitt