Middleware and Distributed Systems Peer-to-Peer … · Middleware and Distributed Systems...

35
Middleware and Distributed Systems Peer-to-Peer Systems Martin v. Löwis Freitag, 12. Februar 2010

Transcript of Middleware and Distributed Systems Peer-to-Peer … · Middleware and Distributed Systems...

Middleware and Distributed Systems

Peer-to-Peer Systems

Martin v. Löwis

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Peer-to-Peer Systems (P2P)

• Concept of a decentralized large-scale distributed system

• Large number of networked computers (peers)

• Each peer has equivalent capabilities and responsibilities, merging the roles of client and server

• Data distribution over participants, no central authority

• Avoids limitations of pure client/server in terms of scalability

• Increased interest with file-sharing applications (1999)

• First peer-based systems long before the Internet

• USENET (1979), FidoNet BBS message exchange system (1984)

2

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Usenet

• Started 1979, after introduction of UUCP in UNIX v7

• Tom Truscott and Jim Ellis, Grad students at Two Duke University

• Reading and posting of articles in distributed newsgroups

• User subscription to hierarchically ordered newsgroups

• Synchronization of client application with local news server, local server synchronizes with other news-feeds

• Time of slow networks, batch transfer of messages twice a day

• Flooding to all servers which did no see the new message so far

• Message disappears from group after some time (meanwhile web archives)

• Also binary transport (uuencode, Base64 / MIME encoding) - alt.binary

• NNTP for TCP transport (1985), message format similar to eMail (RFC 850, 1983)

3

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Characteristics Of P2P

• Placement of data objects across many hosts

• Balancing of access load, techniques for search and retrieval of data

• Each participating machines contributes resources

• Volatile and non-exclusive availability of nodes

• Nodes usually disappear, cheat, or fail

• Better scalability for large number of objects, due to distributed storage

• Routes and object references can be replicated, tolerating failures of nodes

• Complexity and runtime behavior of modern large-scale P2P systems still under research (P2P crawlers)

4

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Routing Overlays

• Routing overlay: Own network over another set of networks

• Addresses its own nodes on-top-of existing network nodes

• Overlay network provides full-meshed connectivity graph to application

• Unstructured P2P Overlay

• Peers build random graph starting from boot peer

• Flooding or random graph walk, supports content-based lookup

• Two-tier approach: Unstructured super-peers, with connected leaf peers

• Examples: Gnutella, eDonkey, FastTrack, Kazaa, Skype(?)

• Structured P2P Overlay: Assign keys to data items and build graph that maps each key to a particular node

5

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

File Sharing With Unstructured P2P Overlays

• First Generation File Sharing

• Napster

• Central index, distributed data

• Consideration of hops between peers

• Second Generation File Sharing

• Freenet, Gnutella, Kazaa, BitTorrent

• No central entity

• Improved anonymity

• Super-peer concepts

6

Napster Index Server

Napster Index Server

Peer

Peer

Peer

Peer

Peer

4. IndexUpdate

1. LocationRequest

2. Listof Peers

3. FileTransfer

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Gnutella• Justin Frankel and Tom Pepper, 2000

• Simple spreading of search queries over all peers

• Initial neighbor from external source (built-in, IRC, gWebCache, ...)

• First request for working addresses from other peers

• Discovery of new peers by TTL-restricted multi-hop ping messages

• Pong message contains IP and port number for further connections

• Travels original overlay path back (by cached message id on intermediaries)

• Each message typically sent to all known neighbor peers

• Descriptor ID (to avoid cycles), TTL, Hops field (TTLi + Hopsi = TTL0), payload

• Periodic one-hop ping messages to all connected peers, support for “bye” message

7

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Gnutella Network

8

• Based on UDP, typically long search duration and high network load

• Remote peers might only have open Gnutella port -> push request message

• Super peers make up the overlay, usually have permanent internet connection

• Leaf peers have intermittent connectivity, using super peers as proxies

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Gnutella

• Discovery can be enhanced by Pong caching

• Queries similar to discovery Pings, meanwhile direct response sending

• Upon receiving, peer looks up local content if query matches

• Data transfer outside of overlay protocol

• Lower and upper limit on amount of peer connections

• Peer is in connecting state, connected state or full state

• Dynamic querying

• Only gather enough results to satisfy the user (50-200), by starting with low TTL queries

• Rare matches: Many approximately visited peers, low result count

9

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

BitTorrent Protocol

• Bram Cohen, 2001

• Protocol for distributing files

• Content identified by announce URL, defined in metadata (.torrent) file

• Torrent files available from Indexer web sites

• Downloaders (peers) upload to each other, distribution starts with first downloader that has the complete file

• Tracker: HTTP/HTTPS server providing list of peers for announce URL

• Subject for closing in recent Copyright law suites

• Metainfo files (torrents)

• No focus on content localization, but on efficient content delivery instead

10

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

BitTorrent Tracker Protocol• Torrent file

• Announce URL(s) for tracker(s)

• Suggested file name, file length; piece size (typically 512kB) and piece count

• SHA1 hash values of all pieces

• Tracker HTTP GET request parameters

• Hash of torrent file information

• Own (randomly chosen) peer id, includes tag for type of client software ; IP and port (6881 - 6889) the downloader listens on, optional client key

• Uploaded / downloaded / left bytes for the file(s)

• Number of demanded peers for download (default 50)

• Event: Started, completed, stopped

11

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

BitTorrent Tracker Protocol

• Tracker response:

• Human-readable error, or list of peer ID‘s and IP addresses

• Timer how long client should wait between subsequent requests

• Number of peers with completed file (seeders)

• Number of peers with incomplete file (leechers)

• Number of peers is relevant to protocol overhead, since notification of downloaded pieces is sent to all peers (-> typically not more than 25 peers)

• Peers report status to tracker every 30 minutes, or on status change

• If peer set size falls below limit (~20), tracker is contacted again

• DHT extension - peer acts as tracker, based on Kademlia DHT (UDP)

12

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

BitTorrent Peer Protocol

• Clients maintains state information for each peer

• choked - client requests will not be answered until unchoke notification

• interested - remote peer notified interest for blocks, and will start requesting after unchoke

• Clients needs also to maintain its own interest in peer packets, and if it has choked the remote peer

• Clients start for each peer with „choked“ and „not interested“

• Download of piece from peer: client claims interest and is „not choked“

• Upload of piece: peer is „interested“, and client is not choking him

• Client should always notify peers about interest, even in choked state

13

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Peer Wire Protocol

• TCP connection, starts with handshake message from both sides

• Human-readable protocol header, hash of torrent file, peer ID

• Handshake for non-served torrent results in connection dropping

• Trackers send out handshake messages without peerID for NAT-checking

• Protocol messages

• <length prefix><message id><payload>

• keep-alive message: typically connection drop after 2 minutes

• choke, unchoke, interested, not interested messages

• have message: 4-byte index for downloaded and verified piece

• Suppression of HAVE messages for pieces the peer already has

14

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Peer Wire Protocol

• bitfield message: Optional after handshake, bitmask for available pieces

• request (piece index, begin, length) message: Request block of data from specified piece

• Close connection on big data requests (discussions)

• Typically 16kB - 32 kB requests, latency vs. slow lines

• piece message: Requested payload, with index, begin, and length

• cancel message: cancels request for data block

• port message: Port number of the peers DHT tracker, to include in own routing table

15

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Choking Algorithm

• Avoid problems with TCP congestion control in case of many connections

• Cap number of simultaneous transfers, while reciprocating peers that allow downloading

• Un-choke three of the interested peers by best download rate

• Non-interested peers with better rate are un-choked, in case preferred

• If client has complete file, use upload rate instead to decide

• Find out if unused peers might behave better

• Optimistic un-choking: Pick one peer regardless of download rate

• Avoid fibrillation with minimum delay between choke and un-choke (10s)

• Free riders are penalized

16

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Rarest First Algorithm

• Downloaders should receive pieces from peers in random order, to avoid partitioning of file content (random first algorithm)

• Might lead to unbalanced distribution of pieces

• Rarest first algorithm: Each peer maintains list of number of copies for each piece in available peer set

• Peer selects next piece to download from rarest pieces

• Not used in the beginning, to ensure faster initial download (offer needed)

• Always prioritize requests for blocks of the same piece

• End Game Mode: Last blocks usually come in very slowly

• Last requests are sent to all peers in the set

17

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Other P2P File Sharing Issues

• Anti-Snubbing - avoid to be choked by nearly all peers

• After 1 minute, upload to according peer is stopped (except optimistic unchoke)

• Results in more than one optimistic unchoke with limited peer list

• Encryption features in client applications

• Avoid traffic shaping by ISPs for P2P traffic

• Meanwhile 10% - 80% of Internet traffic through P2P file sharing (depends on information source)

• Anti-leech strategies

• Credit point system in eDonkey

• Special trackers for BitTorrent with minimal upload rate

18

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Structured P2P Overlay

• Provides subject-based lookup, instead of content-based lookup

• Map peer and data identifiers to the same logical ID space -> peers get responsibility for their related data

• Key-based routing of client requests to an object through a sequence of nodes

• Knowledge about replica location and ‘nearest’ valid object [Plaxton97]

• Hash value as typical opaque object identifier

• High-level APIs: Distributed Hash Table (DHT) and Distributed Object Location and Routing (DOLR)

• Examples: Pastry, Chord, CAN

• Applications: Digital library, object location in MMOG, spam filtering

19

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Distributed Hash Table (DHT)

• Node makes new data (object) available, together with objectID

• Overlay must replicate and store data, to be reachable by all clients

• Replicas stored at all nodes responsible for this objectID

• Client submits request for particular objectID

• Overlay routes the request to the nearest replica

• Client requests removal of data identified by objectID

• Overlay must remove associated data from responsible nodes

• Nodes may join or leave

• Overlay must re-arrange responsibilities for data replicas

• Example: Pastry communication library

20

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Distributed Object Location and Routing (DOLR)

• Objects can be stored anywhere, DOLR layer must maintain mapping between objectID and replica node addresses

• Replication location decision outside of the routing protocol

• Node makes new objectID available

• Overlay must recognize this node as responsible for data-derived objectID

• Nodes wants to send request to n objects identified by objectID

• Overlay forwards request to responsible node(s)

• Example: Tapestry communication framework

• Overlay behavior can be implemented with DHT approach

21

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Programming Interfaces

• Distributed Hash Table Overlay

• put(objectID, data)

• remove(objectID)

• value=get(objectID)

• Distributed Object Location And Routing Overlay

• publish(objectID)

• unpublish(objectID)

• sendToObject(msg, objectID, n)

22

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Pastry

• Since 2001, base framework for several P2P applications (Antony Rowstron - Microsoft Research, Peter Druschel - Rice University)

• Each node gets nodeID from strong hash function, based on join time and physical identifier (e.g. IP address or public key)

• Assumes large distance of adjacent nodes for fault tolerance (avalanche effect)

• Subject-based routing: Route message to peer with nodeId that is numerically closest to the given subject (==destination id) of the message

• Final peer is responsible to handle the message content

• Frameworks differ in proximity metric for message subject and nodeId

• Prefix routing with 128bit IDs in ring overlay

• Routing of message in O(log N) steps, routing table creation in O(log N)

• Routing scheme typically implemented on UDP without acknowledge23

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Pastry Application Interface

• Pastry exports:

• nodeId = pastryInit(Credentials) : Local node joins Pastry network

• route(msg, key) : Route given message to nodeId which is numerically closest to key

• send(msg, IP address) : Send message to specified node through Pastry

• Application exports:

• deliver(msg, key) : Message received for local node (by route or send)

• forward(msg, key, nextId) : Called before forwarding to next node, application can terminate message or change next node

• newLeafs(leafSet) : Called whenever leaf set changes

24

Freitag, 12. Februar 2010

NodeId 10233102

-0-2212102 1 -2-2301203 -3-1203203

0 1-1-301233 1-2-230203 1-3-021022

Routing table

10-0-31203 10-1-32102 2 10-3-23302

102-0-0230 102-1-1302 102-2-2302 3

1023-0-322 1023-1-000 1023-2-121 3

10233-0-01 1 10233-2-32

0 102331-2-0

2

Neighborhood set13021022 10200230 11301233 31301233

02212102 22301203 31203203 33213321

Leaf set10233033 10233021 10233120 10233122

10233001 10233000 10233230 10233232

LARGERSMALLER

Fig. 1. State of a hypothetical Pastry node with nodeId 10233102, , and . All numbers

are in base 4. The top row of the routing table is row zero. The shaded cell in each row of the

routing table shows the corresponding digit of the present node’s nodeId. The nodeIds in each

entry have been split to show the common prefix with 10233102 - next digit - rest of nodeId. The

associated IP addresses are not shown.

rows with entries each. The entries at row of the routing table

each refer to a node whose nodeId shares the present node’s nodeId in the first digits,

but whose th digit has one of the possible values other than the th

digit in the present node’s id.

Each entry in the routing table contains the IP address of one of potentially many

nodes whose nodeId have the appropriate prefix; in practice, a node is chosen that is

close to the present node, according to the proximitymetric.We will show in Section 2.5

that this choice provides good locality properties. If no node is known with a suitable

nodeId, then the routing table entry is left empty. The uniform distribution of nodeIds

ensures an even population of the nodeId space; thus, on average, only rows

are populated in the routing table.

The choice of involves a trade-off between the size of the populated portion of the

routing table (approximately entries) and the maximum number

of hops required to route between any pair of nodes ( ). With a value of

and nodes, a routing table contains on average 75 entries and the expected number

of routing hops is 5, whilst with nodes, the routing table contains on average 105

entries, and the expected number of routing hops in 7.

The neighborhood set contains the nodeIds and IP addresses of the nodes

that are closest (according the proximitymetric) to the local node. The neighborhood set

is not normally used in routing messages; it is useful in maintaining locality properties,

as discussed in Section 2.5. The leaf set is the set of nodes with the numerically

closest larger nodeIds, and the nodes with numerically closest smaller nodeIds,

relative to the present node’s nodeId. The leaf set is used during the message routing,

as described below. Typical values for and are or .

P2P | Middleware and Distributed Systems MvL 2009

Pastry Routing Information Example

• 16bit nodeIds, b=2, L=8

• Entry syntax:common prefix with 10233102 - next digit - rest of nodeId

• Shaded cell shows corresponding digit of present node nodeId‘s

• Rows are managed when nodes join or leave

• Circular ID space: lower neighbor of ID 0 is ID 216-1

25

(C) Rowstron & Druschel

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Pastry Routing Information

• Each node maintains routing table, neighborhood set and leaf set

• IDs as hexadecimal values, one row per prefix length

• Entry keys in row match prefix length digits, but not the next one

• Entry contains one of the possible IP addresses matching the according prefix length, under consideration of network proximity (might be empty)

• Length of row (2b-1) depends on configuration parameter b, trade-off between routing table size and maximum number of hops

• Neighborhood set contains nodeIds and IP addresses of closest nodes

• Normally not used, good for locality properties

• Leaf node set contains L/2 numerically closest smaller and L/2 larger nodeIDs

26

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Routing Algorithm in Pastry

• Incoming message for node

• Check if destination key falls in the range of the leaf set, then forward directly to destination node

• Forward message to a node that shares a common prefix with the key by at least one more digit

• If entry is empty or node not reachable, forward to node which shares same prefix length as current node, and is numerically closer to destination key

• Best-possible destination is reached if leaf set has no better candidate

• Routing always converges, since each step takes message to a node with longer prefix share, or smaller numerical distance

27

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Pastry Node Arrival• New node X knows nearby Pastry node A by some mechanism (e.g. multicast)

• Node asks A to route special join message with ID of X as destination

• Routed to node Z, which is numerically closest to X

• All nodes on the path send their state tables back to X

• Neighborhood of A is initial neighborhood of X, due to proximity promise

• Leaf set of Z is initial leaf set of X

• Row zero in routing table is independent of own ID -> take from A

• B has valuable row for prefix length 1, C for length 2, ...

• Resulting information forwarded to leaf set, routing entries and neighborhood

• Data exchange with timestamps, to detect in-between changes

28

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Pastry Node Departure

• Neighbor detects failed node in the leaf set

• Asks live node with largest index on side of the failed node for its leaf set, which partially overlaps with present node‘s leaf set

• From new ones, alive node is added to present nodes leaf set

• Each node repairs it‘s leaf set lazily, until L/2 nodes failed simultaneously

• Unlikely event due to demanded diversity of nodes with adjacent numbers

• Failed node in the routing table does not stop routing, but entry must be replaced

• Ask other nodes in same row (or in other rows) for entry with according prefix

• Periodic check of neighborhood, in case ask other neighbors for their values and add the one with the shortest distance

29

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

PAST

• PAST: Distributed replicating file system, based on Pastry

• fileId as hash of file name, client certificate and random salt

• File certificate: fileId, file content hash, creation date

• File and certificate routed via Pastry, with fileId as destination

• Closest node accepts responsibility after certificate checking

• Forwards insert request to other closest nodes

• Lookup finds nearest replica due to proximity consideration of Pastry

• Replica diversion: Balance remaining free space in leaf set - allow to choose other members than the nearest ones in the leaf set

• File diversion: Balancing storage space in nodeId space - vary salt in error case

30

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Tapestry Overlay Network• 2001, Zhao et. al.

• Nodes and application endpoints with ID‘s

• 160bit values, evenly distributed, e.g. by using same hash algorithm

• Every message contains application-specific identifier (similar to port number)

• One large Tapestry network is encouraged, since efficiency increases

• DOLR approach, routing of messages to endpoints by opaque identifiers

• PublishObject (objectID, application ID) - best effort, no confirmation

• UnpublishObject (objectID, application ID) - best effort

• RouteToObject(objectID, application ID) - route message to object

• RouteToNode(Node, application ID, exact destination match)

31

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Routing and Object Location

• Each identifier is mapped to a live node (identifiers root)

• If node ID is the same as the identifier, this one becomes the root

• Each nodes maintains table of outgoing neighbor links

• Common matching prefix, higher levels match more digits, increasing prefix size from hop to hop

• Again similar to classless inter-domain routing (CIDR) for IP addresses

• Non-existent IDs are mapped to somelive node (‚close‘ digit)

• Backup links with same prefix as neighbor link

32

ZHAO et al.: TAPESTRY: A RESILIENT GLOBAL-SCALE OVERLAY FOR SERVICE DEPLOYMENT 43

Fig. 1. Tapestry routing mesh from the perspective of a single node. Outgoingneighbor links point to nodes with a common matching prefix. Higher levelentries match more digits. Together, these links form the local routing table.

Fig. 2. Path of a message. The path taken by a message originating from nodedestined for node in a Tapestry mesh.

to longest prefix routing used by classless interdomain routing(CIDR) IP address allocation [30]. A node has a neighbor mapwith multiple levels, where each level contains links to nodesmatching a prefix up to a digit position in the ID and contains anumber of entries equal to the ID’s base. The primary th entryin the th level is the ID and location of the closest node thatbegins with “ ” (e.g., the ninth entry of thefourth level for node is the closest node with an ID thatbegins with . It is this prescription of “closest node” thatprovides the locality properties of Tapestry. Fig. 1 shows someof the outgoing links of a node.

Fig. 2 shows a path that a message might take through the in-frastructure. The router for the th hop shares a prefix of length

with the destination ID; thus, to route, Tapestry looks inits th level map for the entry matching the next digitin the destination ID. This method guarantees that any existingnode in the system will be reached in at most logicalhops, in a system with namespace size , IDs of base , andassuming consistent neighbor maps. When a digit cannot bematched, Tapestry looks for a “close” digit in the routing table;we call this surrogate routing [1], where each nonexistent IDis mapped to some live node with a similar ID. Fig. 3 detailsthe NEXTHOP function for chosing an outgoing link. It is thisdynamic process that maps every identifier to a unique rootnode .

The challenge in a dynamic network environment is tocontinue to route reliably even when intermediate links arechanging or faulty. To help provide resilience, we exploitnetwork path diversity in the form of redundant routing paths.

Fig. 3. Pseudocode for NEXTHOP . This function locates the next hop towardthe root given the previous hop number, , and the destination GUID andreturns next hop or self if local node is the root.

Fig. 4. Tapestry object publish example. Two copies of an object arepublished to their root node at . Publish messages route to root, depositinga location pointer for the object at each hop encountered along the way.

Primary neighbor links shown in Fig. 1 are augmented bybackup links, each sharing the same prefix.2 At the th routinglevel, the neighbor links differ only on the th digit. There are

pointers on a level, and the total size of the neighbor mapis . Each node also stores reverse references(backpointers) to other nodes that point at it. The expected totalnumber of such entries is .

2) Object Publication and Location: As shown above, eachidentifier has a unique root node assigned by the routingprocess. Each such root node inherits a unique spanning tree forrouting, with messages from leaf nodes traversing intermediatenodes en route to the root. We utilize this property to locateobjects by distributing soft-state directory information acrossnodes (including the object’s root).

A server , storing an object (with GUID, , and root3), periodically advertises or publishes this object by routing

a publish message toward (see Fig. 4). In general, thenodeID of is different from is the unique [2]node reached through surrogate routing by successive calls toNEXTHOP . Each node along the publication path storesa pointer mapping, , instead of a copy of the objectitself. When there are replicas of an object on separate servers,each server publishes its copy. Tapestry nodes store location

2Current implementations keep two additional backups.3Note that objects can be assigned multiple GUIDs mapped to different root

nodes for fault tolerance.

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Tapestry Object Publication

• Each identifier has a root node

• Participants publish objects by periodically routing ‚publish‘ messages towards the root node

• Each node along the path stores object key and publisher IP

• Each replica is announced in the same way - nodes store ordered replica list based on own network latency to publisher

• Objects are located by routing a message towards the root node

• Each node along the path checks mapping and redirects accordingly

• Convergence of nearby paths heading to the same direction

• Client locates object by routing a message to the route node

• Each node on path checks for cached pointer and re-directs to server

33

Freitag, 12. Februar 2010

ZHAO et al.: TAPESTRY: A RESILIENT GLOBAL-SCALE OVERLAY FOR SERVICE DEPLOYMENT 43

Fig. 1. Tapestry routing mesh from the perspective of a single node. Outgoingneighbor links point to nodes with a common matching prefix. Higher levelentries match more digits. Together, these links form the local routing table.

Fig. 2. Path of a message. The path taken by a message originating from nodedestined for node in a Tapestry mesh.

to longest prefix routing used by classless interdomain routing(CIDR) IP address allocation [30]. A node has a neighbor mapwith multiple levels, where each level contains links to nodesmatching a prefix up to a digit position in the ID and contains anumber of entries equal to the ID’s base. The primary th entryin the th level is the ID and location of the closest node thatbegins with “ ” (e.g., the ninth entry of thefourth level for node is the closest node with an ID thatbegins with . It is this prescription of “closest node” thatprovides the locality properties of Tapestry. Fig. 1 shows someof the outgoing links of a node.

Fig. 2 shows a path that a message might take through the in-frastructure. The router for the th hop shares a prefix of length

with the destination ID; thus, to route, Tapestry looks inits th level map for the entry matching the next digitin the destination ID. This method guarantees that any existingnode in the system will be reached in at most logicalhops, in a system with namespace size , IDs of base , andassuming consistent neighbor maps. When a digit cannot bematched, Tapestry looks for a “close” digit in the routing table;we call this surrogate routing [1], where each nonexistent IDis mapped to some live node with a similar ID. Fig. 3 detailsthe NEXTHOP function for chosing an outgoing link. It is thisdynamic process that maps every identifier to a unique rootnode .

The challenge in a dynamic network environment is tocontinue to route reliably even when intermediate links arechanging or faulty. To help provide resilience, we exploitnetwork path diversity in the form of redundant routing paths.

Fig. 3. Pseudocode for NEXTHOP . This function locates the next hop towardthe root given the previous hop number, , and the destination GUID andreturns next hop or self if local node is the root.

Fig. 4. Tapestry object publish example. Two copies of an object arepublished to their root node at . Publish messages route to root, depositinga location pointer for the object at each hop encountered along the way.

Primary neighbor links shown in Fig. 1 are augmented bybackup links, each sharing the same prefix.2 At the th routinglevel, the neighbor links differ only on the th digit. There are

pointers on a level, and the total size of the neighbor mapis . Each node also stores reverse references(backpointers) to other nodes that point at it. The expected totalnumber of such entries is .

2) Object Publication and Location: As shown above, eachidentifier has a unique root node assigned by the routingprocess. Each such root node inherits a unique spanning tree forrouting, with messages from leaf nodes traversing intermediatenodes en route to the root. We utilize this property to locateobjects by distributing soft-state directory information acrossnodes (including the object’s root).

A server , storing an object (with GUID, , and root3), periodically advertises or publishes this object by routing

a publish message toward (see Fig. 4). In general, thenodeID of is different from is the unique [2]node reached through surrogate routing by successive calls toNEXTHOP . Each node along the publication path storesa pointer mapping, , instead of a copy of the objectitself. When there are replicas of an object on separate servers,each server publishes its copy. Tapestry nodes store location

2Current implementations keep two additional backups.3Note that objects can be assigned multiple GUIDs mapped to different root

nodes for fault tolerance.

P2P | Middleware and Distributed Systems MvL 2009

Object Announcement

34

Freitag, 12. Februar 2010

P2P | Middleware and Distributed Systems MvL 2009

Overlay Management in Tapestry

• Node insertion

• Multicast message to all nodes sharing the same prefix

• May take over to be root node for some objects

• Node departure

• Transmits replacement node for each level

• Node failure is handled by backup links on other nodes

35

Freitag, 12. Februar 2010