1 Crawling Gnutella Network By: Samer Al-Kiswany.
-
date post
15-Jan-2016 -
Category
Documents
-
view
227 -
download
0
Transcript of 1 Crawling Gnutella Network By: Samer Al-Kiswany.
1
Crawling Gnutella Network
By:
Samer Al-Kiswany
2
Roadmap
EECE 411
• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol
• Crawling topology information• Crawling node content
3
Introduction
EECE 411
Gnutella network is a decentralized peer to peer system for file sharing.
Original created by Justin Frankel of Nullsoft Large scale
today up to 4M nodes, 1000TB data, 100M files today Fast growth in its early stages
more than 50 times during first half of 2001(50 times again 2001 to 2006)
Self-organizing network Open, simple and flexible protocol
4
Roadmap
EECE 411
• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol
• Crawling topology information• Crawling node content
5
Gnutella Network Structure
EECE 411
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
6
Roadmap
EECE 411
• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol
• Crawling topology information• Crawling node content
7
Basic Primitives for File Sharing
EECE 411
Join: How do I begin participating? Publish: How do I advertise my file(s)? Search: How do I find a file? Fetch: How do I retrieve a file?
8
Gnutella Protocol Overview
EECE 411
Join: on startup, client contacts an ultrapeer node(s) Publish: no need Search:
Ask the ultrapeer node The ultrapeer will propagate the questions to other
ultrapeers and will return the answer back Fetch: get the file directly from peer (HTTP)
9
Roadmap
EECE 411
• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol
• Crawling topology information• Crawling node content
10
Crawling a Gnutella node
EECE 411
By Crawling we are interested in two main pieces of information: With whom the node is connected ? - Topology information
Gnutella protocols terms “Crawling/Communicating Network Topology Information ”
What files the node is sharing with others?
Gnutella protocol terms “Browsing Host ”
11
Crawling Topology Information
EECE 411
Gnutella protocol 0.6 supports network topology information crawling !!!
Gnutella Network
Gnutella Network
Topo crawl
Topo information
Topology Information:
- Ultrapeers
- Leaves
12
GNUTELLA CONNECT/0.6 User-Agent: LimeWire (crawl) X-Ultrapeer: False Query-Routing: 0.1 Crawler: 0.1
GNUTELLA/0.6 200 OK User-Agent: BearShare Leaves: 127.0.0.1:6346,127.0.0.2:6346 Peers: 127.0.0.4:6346,127.0.0.5:6346
EECE 411
Topo Crawl
Topo information
GNUTELLA/0.6 200 OK
Crawling Topology Information
13
Browsing Node Content
EECE 411
Gnutella Network
Gnutella Network
Browse Host
List of files
14
GET / HTTP/1.1Host: Crawler_IP:PORTUser-Agent: UBCECEAccept: application/x-gnutella-
packetsConnection: close
HTTP/1.1 200 OKServer: LimeWire/x.yContent-Type: application/x-gnutella-
packetsConnection:close<List of files>
EECE 411
Browse Host
List of files
Query Hit Message
Browsing Node Content
15
Query Hit Parsing
EECE 411
1 2 A B C D E F 3
1 – Gnutella message header
important field : message length.
2 – Query Hit Header
important field : Number of files
A-F– list of shared files
includes file name and size
3 – Other Gnutella protocol fields
The HTTP response message may
contain more than one query Hit
response
Query Hit Message
1 2 A B C D E F 3
Query Hit Message
2 A B C D E F 3
Query Hit Message
- - - 1
16
Limitations - Does this always work ?
EECE 411
Topology Crawling:
• The topology information crawling is not supported by some Gnutella protocol v0.4 implementations
Host Browsing :
• Some Gnutella node implementations will return the list of files in HTML (BearShare for instance). (will not respond with Query Hit message)
17
Roadmap
EECE 411
• Introduction• Gnutella network structure• Gnutella protocol overview• Gnutella crawling protocol
• Crawling topology information• Crawling node content
18
Single Gnutella-Node Crawler
EECE 411
A proof of concept implementation of single Gnutella-node crawler.
Available through the following linkhttp://www.ece.ubc.ca/~samera/TA/project/sgnc.html
The main class that implements the crawling protocol is the Crawler class:
• crawlpeers(ip_address, port)• parsePeers(byte[] )• listFiles(ip_address, port)• processQueryHit(byte[] )
19
Project Phase II
EECE 411
• Implement a single-node Gnutella network crawler • Report:
The active leaf nodes Information regarding the “agent” (i.e., the implementation:
LimeWire , BearShare …etc) The domain name corresponding to the node IP address.
Avoid cycles !!
20
Project Phase III
EECE 411
• Implement a master/worker crawler with Java NIO sockets.
Gnutella NetworkGnutella Network
MasterPrimary
Crawl the following list : …
Results: peers IPs, statistics
Crawled To be Crawled
Problems ?
Problems ?(Hint: Failures)
21
Project Phase III
EECE 411
• Implement a master/worker crawler with Java NIO sockets.
• Adopt primary/backup replication for the manager
Gnutella NetworkGnutella Network
MasterPrimary
Crawled To be Crawled
MasterBackup X
22
Previous Years Ideas – Part I
EECE 411
Programming languages / frameworks / protocols • Java (the vast majority) • Scala• Apache MINA framework.• Java RMI• Jython• XML-RPC• SQL• Python/Perl/Shell/cron jobs
Architecture• Master/worker (the majority)• Hierarchical
23
Previous Years Ideas – Part II
EECE 411
Design choices• NIO at both master and workers• Careful load balancing • Keep the workers always busy• Bootstrapping new workers if old works fail
Additional bells and whistles • GUI manager• Statistics in real-time through GUI and web page• Graphviz
24
References
EECE 411
Other references:• http://gnutella-specs.rakjar.de/index.php/Main_Page• www.limewire.com
• Single Gnutella-Node Crawler: http://www.ece.ubc.ca/~samera/TA/project/sgnc.html
• Gnutella Crawling protocol : http://www.ece.ubc.ca/~samera/TA/project/Gnuttela-Protocol.html
25
Thank you
www.ece.ubc.ca/~samera