1 Crawling Gnutella Network By: Samer Al-Kiswany.

Post on 15-Jan-2016

227 views 0 download

Tags:

Transcript of 1 Crawling Gnutella Network By: Samer Al-Kiswany.

1

Crawling Gnutella Network

By:

Samer Al-Kiswany

2

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

3

Introduction

EECE 411

Gnutella network is a decentralized peer to peer system for file sharing.

Original created by Justin Frankel of Nullsoft Large scale

today up to 4M nodes, 1000TB data, 100M files today Fast growth in its early stages

more than 50 times during first half of 2001(50 times again 2001 to 2006)

Self-organizing network Open, simple and flexible protocol

4

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

5

Gnutella Network Structure

EECE 411

Gnutella Protocol 0.6

Two tier architectures of ultrapeers and leaves

Ultrapeers

Leaves

6

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

7

Basic Primitives for File Sharing

EECE 411

Join: How do I begin participating? Publish: How do I advertise my file(s)? Search: How do I find a file? Fetch: How do I retrieve a file?

8

Gnutella Protocol Overview

EECE 411

Join: on startup, client contacts an ultrapeer node(s) Publish: no need Search:

Ask the ultrapeer node The ultrapeer will propagate the questions to other

ultrapeers and will return the answer back Fetch: get the file directly from peer (HTTP)

9

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

10

Crawling a Gnutella node

EECE 411

By Crawling we are interested in two main pieces of information: With whom the node is connected ? - Topology information

Gnutella protocols terms “Crawling/Communicating Network Topology Information ”

What files the node is sharing with others?

Gnutella protocol terms “Browsing Host ”

11

Crawling Topology Information

EECE 411

Gnutella protocol 0.6 supports network topology information crawling !!!

Gnutella Network

Gnutella Network

Topo crawl

Topo information

Topology Information:

- Ultrapeers

- Leaves

12

GNUTELLA CONNECT/0.6 User-Agent: LimeWire (crawl) X-Ultrapeer: False Query-Routing: 0.1 Crawler: 0.1

GNUTELLA/0.6 200 OK User-Agent: BearShare Leaves: 127.0.0.1:6346,127.0.0.2:6346 Peers: 127.0.0.4:6346,127.0.0.5:6346

EECE 411

Topo Crawl

Topo information

GNUTELLA/0.6 200 OK

Crawling Topology Information

13

Browsing Node Content

EECE 411

Gnutella Network

Gnutella Network

Browse Host

List of files

14

GET / HTTP/1.1Host: Crawler_IP:PORTUser-Agent: UBCECEAccept: application/x-gnutella-

packetsConnection: close

HTTP/1.1 200 OKServer: LimeWire/x.yContent-Type: application/x-gnutella-

packetsConnection:close<List of files>

EECE 411

Browse Host

List of files

Query Hit Message

Browsing Node Content

15

Query Hit Parsing

EECE 411

1 2 A B C D E F 3

1 – Gnutella message header

important field : message length.

2 – Query Hit Header

important field : Number of files

A-F– list of shared files

includes file name and size

3 – Other Gnutella protocol fields

The HTTP response message may

contain more than one query Hit

response

Query Hit Message

1 2 A B C D E F 3

Query Hit Message

2 A B C D E F 3

Query Hit Message

- - - 1

16

Limitations - Does this always work ?

EECE 411

Topology Crawling:

• The topology information crawling is not supported by some Gnutella protocol v0.4 implementations

Host Browsing :

• Some Gnutella node implementations will return the list of files in HTML (BearShare for instance). (will not respond with Query Hit message)

17

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

18

Single Gnutella-Node Crawler

EECE 411

A proof of concept implementation of single Gnutella-node crawler.

Available through the following linkhttp://www.ece.ubc.ca/~samera/TA/project/sgnc.html

The main class that implements the crawling protocol is the Crawler class:

• crawlpeers(ip_address, port)• parsePeers(byte[] )• listFiles(ip_address, port)• processQueryHit(byte[] )

19

Project Phase II

EECE 411

• Implement a single-node Gnutella network crawler • Report:

The active leaf nodes Information regarding the “agent” (i.e., the implementation:

LimeWire , BearShare …etc) The domain name corresponding to the node IP address.

Avoid cycles !!

20

Project Phase III

EECE 411

• Implement a master/worker crawler with Java NIO sockets.

Gnutella NetworkGnutella Network

MasterPrimary

Crawl the following list : …

Results: peers IPs, statistics

Crawled To be Crawled

Problems ?

Problems ?(Hint: Failures)

21

Project Phase III

EECE 411

• Implement a master/worker crawler with Java NIO sockets.

• Adopt primary/backup replication for the manager

Gnutella NetworkGnutella Network

MasterPrimary

Crawled To be Crawled

MasterBackup X

22

Previous Years Ideas – Part I

EECE 411

Programming languages / frameworks / protocols • Java (the vast majority) • Scala• Apache MINA framework.• Java RMI• Jython• XML-RPC• SQL• Python/Perl/Shell/cron jobs

Architecture• Master/worker (the majority)• Hierarchical

23

Previous Years Ideas – Part II

EECE 411

Design choices• NIO at both master and workers• Careful load balancing • Keep the workers always busy• Bootstrapping new workers if old works fail

Additional bells and whistles • GUI manager• Statistics in real-time through GUI and web page• Graphviz

24

References

EECE 411

Other references:• http://gnutella-specs.rakjar.de/index.php/Main_Page• www.limewire.com

• Single Gnutella-Node Crawler: http://www.ece.ubc.ca/~samera/TA/project/sgnc.html

• Gnutella Crawling protocol : http://www.ece.ubc.ca/~samera/TA/project/Gnuttela-Protocol.html

25

Thank you

www.ece.ubc.ca/~samera