1 Crawling Gnutella Network By: Samer Al-Kiswany.

25
1 Crawling Gnutella Network By: Samer Al-Kiswany
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    227
  • download

    0

Transcript of 1 Crawling Gnutella Network By: Samer Al-Kiswany.

Page 1: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

1

Crawling Gnutella Network

By:

Samer Al-Kiswany

Page 2: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

2

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

Page 3: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

3

Introduction

EECE 411

Gnutella network is a decentralized peer to peer system for file sharing.

Original created by Justin Frankel of Nullsoft Large scale

today up to 4M nodes, 1000TB data, 100M files today Fast growth in its early stages

more than 50 times during first half of 2001(50 times again 2001 to 2006)

Self-organizing network Open, simple and flexible protocol

Page 4: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

4

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

Page 5: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

5

Gnutella Network Structure

EECE 411

Gnutella Protocol 0.6

Two tier architectures of ultrapeers and leaves

Ultrapeers

Leaves

Page 6: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

6

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

Page 7: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

7

Basic Primitives for File Sharing

EECE 411

Join: How do I begin participating? Publish: How do I advertise my file(s)? Search: How do I find a file? Fetch: How do I retrieve a file?

Page 8: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

8

Gnutella Protocol Overview

EECE 411

Join: on startup, client contacts an ultrapeer node(s) Publish: no need Search:

Ask the ultrapeer node The ultrapeer will propagate the questions to other

ultrapeers and will return the answer back Fetch: get the file directly from peer (HTTP)

Page 9: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

9

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

Page 10: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

10

Crawling a Gnutella node

EECE 411

By Crawling we are interested in two main pieces of information: With whom the node is connected ? - Topology information

Gnutella protocols terms “Crawling/Communicating Network Topology Information ”

What files the node is sharing with others?

Gnutella protocol terms “Browsing Host ”

Page 11: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

11

Crawling Topology Information

EECE 411

Gnutella protocol 0.6 supports network topology information crawling !!!

Gnutella Network

Gnutella Network

Topo crawl

Topo information

Topology Information:

- Ultrapeers

- Leaves

Page 12: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

12

GNUTELLA CONNECT/0.6 User-Agent: LimeWire (crawl) X-Ultrapeer: False Query-Routing: 0.1 Crawler: 0.1

GNUTELLA/0.6 200 OK User-Agent: BearShare Leaves: 127.0.0.1:6346,127.0.0.2:6346 Peers: 127.0.0.4:6346,127.0.0.5:6346

EECE 411

Topo Crawl

Topo information

GNUTELLA/0.6 200 OK

Crawling Topology Information

Page 13: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

13

Browsing Node Content

EECE 411

Gnutella Network

Gnutella Network

Browse Host

List of files

Page 14: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

14

GET / HTTP/1.1Host: Crawler_IP:PORTUser-Agent: UBCECEAccept: application/x-gnutella-

packetsConnection: close

HTTP/1.1 200 OKServer: LimeWire/x.yContent-Type: application/x-gnutella-

packetsConnection:close<List of files>

EECE 411

Browse Host

List of files

Query Hit Message

Browsing Node Content

Page 15: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

15

Query Hit Parsing

EECE 411

1 2 A B C D E F 3

1 – Gnutella message header

important field : message length.

2 – Query Hit Header

important field : Number of files

A-F– list of shared files

includes file name and size

3 – Other Gnutella protocol fields

The HTTP response message may

contain more than one query Hit

response

Query Hit Message

1 2 A B C D E F 3

Query Hit Message

2 A B C D E F 3

Query Hit Message

- - - 1

Page 16: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

16

Limitations - Does this always work ?

EECE 411

Topology Crawling:

• The topology information crawling is not supported by some Gnutella protocol v0.4 implementations

Host Browsing :

• Some Gnutella node implementations will return the list of files in HTML (BearShare for instance). (will not respond with Query Hit message)

Page 17: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

17

Roadmap

EECE 411

• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol

• Crawling topology information• Crawling node content

Page 18: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

18

Single Gnutella-Node Crawler

EECE 411

A proof of concept implementation of single Gnutella-node crawler.

Available through the following linkhttp://www.ece.ubc.ca/~samera/TA/project/sgnc.html

The main class that implements the crawling protocol is the Crawler class:

• crawlpeers(ip_address, port)• parsePeers(byte[] )• listFiles(ip_address, port)• processQueryHit(byte[] )

Page 19: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

19

Project Phase II

EECE 411

• Implement a single-node Gnutella network crawler • Report:

The active leaf nodes Information regarding the “agent” (i.e., the implementation:

LimeWire , BearShare …etc) The domain name corresponding to the node IP address.

Avoid cycles !!

Page 20: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

20

Project Phase III

EECE 411

• Implement a master/worker crawler with Java NIO sockets.

Gnutella NetworkGnutella Network

MasterPrimary

Crawl the following list : …

Results: peers IPs, statistics

Crawled To be Crawled

Problems ?

Problems ?(Hint: Failures)

Page 21: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

21

Project Phase III

EECE 411

• Implement a master/worker crawler with Java NIO sockets.

• Adopt primary/backup replication for the manager

Gnutella NetworkGnutella Network

MasterPrimary

Crawled To be Crawled

MasterBackup X

Page 22: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

22

Previous Years Ideas – Part I

EECE 411

Programming languages / frameworks / protocols • Java (the vast majority) • Scala• Apache MINA framework.• Java RMI• Jython• XML-RPC• SQL• Python/Perl/Shell/cron jobs

Architecture• Master/worker (the majority)• Hierarchical

Page 23: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

23

Previous Years Ideas – Part II

EECE 411

Design choices• NIO at both master and workers• Careful load balancing • Keep the workers always busy• Bootstrapping new workers if old works fail

Additional bells and whistles • GUI manager• Statistics in real-time through GUI and web page• Graphviz

Page 24: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

24

References

EECE 411

Other references:• http://gnutella-specs.rakjar.de/index.php/Main_Page• www.limewire.com

• Single Gnutella-Node Crawler: http://www.ece.ubc.ca/~samera/TA/project/sgnc.html

• Gnutella Crawling protocol : http://www.ece.ubc.ca/~samera/TA/project/Gnuttela-Protocol.html

Page 25: 1 Crawling Gnutella Network By: Samer Al-Kiswany.

25

Thank you

www.ece.ubc.ca/~samera