Distributed Hash Tables

Creating a Trustworthy Active Web

Distributed Hash TablesKen Birman

Cornell University. CS5410 Fall 2008. What is a Distributed Hash Table (DHT)?Exactly that A service, distributed over multiple machines, with hash table semanticsPut(key, value), Value = Get(key)Designed to work in a peer-to-peer (P2P) environmentNo central controlNodes under different administrative controlBut of course can operate in an infrastructure senseMore specificallyHash table semantics:Put(key, value), Value = Get(key)Key is a single flat stringLimited semantics compared to keyword searchPut() causes value to be stored at one (or more) peer(s)Get() retrieves value from a peerPut() and Get() accomplished with unicast routed messagesIn other words, it scalesOther API calls to support application, like notification when neighbors come and goP2P environmentNodes come and go at will (possibly quite frequently---a few minutes)Nodes have heterogeneous capacitiesBandwidth, processing, and storageNodes may behave badlyPromise to do something (store a file) and not do it (free-loaders)Attack the systemSeveral flavors, each with variantsTapestry (Berkeley)Based on Plaxton trees---similar to hypercube routingThe first* DHTComplex and hard to maintain (hard to understand too!)CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge)Second wave of DHTs (contemporary with and independent of each other)* Landmark Routing, 1988, used a form of DHT called Assured Destination Binding (ADB)Basics of all DHTsGoal is to build some structured overlay network with the following characteristics:Node IDs can be mapped to the hash key spaceGiven a hash key as a destination address, you can route through the network to a given nodeAlways route to the same node no matter where you start from1333588197111127Simple example (doesnt scale)Circular number space 0 to 127Routing rule is to move counter-clockwise until current node ID key, and last hop node ID < key

Example: key = 42Obviously you will route to node 58 from no matter where you start133358819711112781Building any DHTNewcomer always starts with at least one known member

1333589711112724Building any DHTNewcomer always starts with at least one known memberNewcomer searches for self in the networkhash key = newcomers node IDSearch results in a node in the vicinity where newcomer needs to be

811333589711112724Building any DHTNewcomer always starts with at least one known memberNewcomer searches for self in the networkhash key = newcomers node IDSearch results in a node in the vicinity where newcomer needs to beLinks are added/removed to satisfy properties of network811333589711112724Building any DHTNewcomer always starts with at least one known memberNewcomer searches for self in the networkhash key = newcomers node IDSearch results in a node in the vicinity where newcomer needs to beLinks are added/removed to satisfy properties of networkObjects that now hash to new node are transferred to new node811333589711112724Insertion/lookup for any DHTHash name of object to produce keyWell-known way to do thisUse key as destination address to route through networkRoutes to the target nodeInsert object, or retrieve object, at the target node811333589711112724foo.htm93Properties of most DHTsMemory requirements grow (something like) logarithmically with NRouting path length grows (something like) logarithmically with NCost of adding or removing a node grows (something like) logarithmically with NHas caching, replication, etcDHT IssuesResilience to failuresLoad BalanceHeterogeneityNumber of objects at each nodeRouting hot spotsLookup hot spotsLocality (performance issue)Churn (performance and correctness issue)SecurityWere going to look at four DHTsAt varying levels of detailCAN (Content Addressable Network)ACIRI (now ICIR)ChordMITKelipsCornellPastryRice/Microsoft CambridgeThings were going to look atWhat is the structure?How does routing work in the structure?How does it deal with node departures?How does it scale?How does it deal with locality?What are the security issues?

CAN structure is a cartesian coordinate space in a D dimensional torus1CAN graphics care of Santashil PalChaudhuri, Rice UnivSimple example in two dimensions12Note: torus wraps on top and sides123Each node in CAN network occupies a square in the space1234With relatively uniform square sizesNeighbors in CAN networkNeighbor is a node that:Overlaps d-1 dimensionsAbuts along one dimension

Route to neighbors closer to targetd-dimensional spacen zonesZone is space occupied by a square in one dimensionAvg. route path length(d/4)(n 1/d)Number neighbors = O(d)Tunable (vary d or n)Can factor proximity into route decisionZ1Z2Z3Z4Zn(x,y)(a,b)Chord uses a circular ID spaceN32N10N100N80N60CircularID Space Successor: node with next highest IDK33, K40, K52K11, K30K5, K10K65, K70K100Key ID Node IDChord slides care of Robert Morris, MIT24Basic LookupN32N10N5N20N110N99N80N60N40Where is key 50?Key 50 isAt N60 Lookups find the IDs predecessor Correct if successors are correctSuccessor Lists Ensure Robust Lookup Each node remembers r successors Lookup can skip over dead nodes to find blocks Periodic check of successor and predecessor linksN32N10N5N20N110N99N80N60N4010, 20, 3220, 32, 4032, 40, 6040, 60, 8060, 80, 9980, 99, 11099, 110, 5110, 5, 105, 10, 2026All r successors have to fail before we have a problem.List ensures we find actual current successor.Chord Finger Table Accelerates LookupsN801/81/161/321/641/128To build finger tables, new node searches for the key values for each finger

To do it efficiently, new nodes obtain successors finger table, and use as a hint to optimize the search27Log(N) means we dont have to track all nodes. Easy to maintain finger info.Chord lookups take O(log N) hopsN32N10N5N20N110N99N80N60Lookup(K19)K1928Maybe note that fingers point to the first relevant node.Drill down on Chord reliabilityInterested in maintaining a correct routing table (successors, predecessors, and fingers)Primary invariant: correctness of successor pointersFingers, while important for performance, do not have to be exactly correct for routing to workAlgorithm is to get closer to the targetSuccessor nodes always do thisMaintaining successor pointersPeriodically run stabilize algorithmFinds successors predecessorRepair if this isnt selfThis algorithm is also run at joinEventually routing will repair itselfFix_finger also periodically runFor randomly selected fingerInitial: 25 wants to join correct ring (between 20 and 30)20302520302520302525 finds successor, and tells successor (30) of itself20 runs stabilize:20 asks 30 for 30s predecessor30 returns 2520 tells 25 of itselfThis time, 28 joins before 20 runs stabilize203025282030252828 finds successor, and tells successor (30) of itself2030282520 runs stabilize:20 asks 30 for 30s predecessor30 returns 2820 tells 28 of itself2030282525 runs stabilize203028252530282020 runs stabilizePastry also uses a circular number spaceDifference is in how the fingers are createdPastry uses prefix match overlap rather than binary splittingMore flexibility in neighbor selection

d46a1c

Route(d46a1c)d462bad4213fd13da365a1fc

d467c4d471f1

Pastry routing table (for node 65a1fc)Pastry nodes also have a leaf set of immediate neighbors up and down the ring

Similar to Chords list of successorsPastry joinX = new node, A = bootstrap, Z = nearest nodeA finds Z for XIn process, A, Z, and all nodes in path send state tables to XX settles on own tablePossibly after contacting other nodesX tells everyone who needs to know about itselfPastry paper doesnt give enough information to understand how concurrent joins work18th IFIP/ACM, Nov 2001Pastry leaveNoticed by leaf set neighbors when leaving node doesnt respondNeighbors ask highest and lowest nodes in leaf set for new leaf setNoticed by routing neighbors when message forward failsImmediately can route to another neighborFix entry by asking another neighbor in the same row for its neighborIf this fails, ask somebody a level up

For instance, this neighbor fails

Ask other neighborsTry asking some neighbor in the same row for its 655x entryIf it doesnt have one, try asking some neighbor in the row below, etc.CAN, Chord, Pastry differencesCAN, Chord, and Pastry have deep similaritiesSome (important???) differences existCAN nodes tend to know of multiple nodes that allow equal progressCan therefore use additional criteria (RTT) to pick next hopPastry allows greater choice of neighborCan thus use additional criteria (RTT) to pick neighborIn contrast, Chord has more determinismHarder for an attacker to manipulate system?Security issuesIn many P2P systems, members may be maliciousIf peers untrusted, all content must be signed to detect forged contentRequires certificate authorityLike we discussed in secure web services talkThis is not hard, so can assume at least this level of securitySecurity issues: Sybil attackAttacker pretends to be multiple systemIf surrounds a node on the circle, can potentially arrange to capture all trafficOr if not this, at least cause a lot of trouble by being many nodesChord requires node ID to be an SHA-1 hash of its IP addressBut to deal with load balance issues, Chord variant allows nodes to replicate themselvesA central authority must hand out node IDs and certificates to go with themNot P2P in the Gnutella senseGeneral security rulesCheck things that can be checkedInvariants, such as successor list in ChordMinimize invariants, maximize randomnessHard for an attacker to exploit randomnessAvoid any single dependenciesAllow multiple paths through the networkAllow content to be placed at multiple nodesBut all this is expensiveLoad balancingQuery hotspots: given object is popularCache at neighbors of hotspot, neighbors of neighbors, etc.Classic caching issuesRouting hotspot: node is on many pathsOf the three, Pastry seems most likely to have this problem, because neighbor selection more flexible (and based on proximity)This doesnt seem adequately studied

Load balancingHeterogeneity (variance in bandwidth or node capacityPoor distribution in entries due to hash function inaccuraciesOne class of solution is to allow each node to be multiple virtual nodesHigher capacity nodes virtualize more oftenBut security makes this harder to doChord node virtualization

10K nodes, 1M objects 20 virtual nodes per node has much better load balance, but each node requires ~400 neighbors! Primary concern: churnChurn: nodes joining and leaving frequentlyJoin or leave requires a change in some number of linksThose changes depend on correct routing tables in other nodesCost of a change is higher if routing tables not correctIn chord, ~6% of lookups fail if three failures per stabilizationBut as more changes occur, probability of incorrect routing tables increasesControl traffic load generated by churnChord and Pastry appear to deal with churn differentlyChord join involves some immediate work, but repair is done periodicallyExtra load only due to join messagesPastry join and leave involves immediate repair of all effected nodes tablesRouting tables repaired more quickly, but cost of each join/leave goes up with frequency of joins/leavesScales quadratically with number of changes???Can result in network meltdown???Kelips takes a different approachNetwork partitioned into N affinity groupsHash of node ID determines which affinity group a node is inEach node knows:One or more nodes in each groupAll objects and nodes in own groupBut this knowledge is soft-state, spread through peer-to-peer gossip (epidemic multicast)!Kelips01230110230202Affinity Groups:peer membership thru consistent hash1N-Affinity group pointersNmembers per affinity groupidhbeatrtt3023490ms23032230msAffinity group view110 knows about other members 230, 30Affinity Groups:peer membership thru consistent hashKelips012301102302021N-Contact pointersNmembers per affinity groupidhbeatrtt3023490ms23032230msAffinity group viewgroupcontactNode2202Contacts202 is a contact for 110 in group 2Affinity Groups:peer membership thru consistent hashKelips012301102302021N-Gossip protocol replicates data cheaplyNmembers per affinity groupidhbeatrtt3023490ms23032230msAffinity group viewgroupcontactNode2202Contactsresourceinfocnn.com110Resource Tuples

cnn.com maps to group 2. So 110 tells group 2 to route inquiries about cnn.com to it.How it worksKelips is entirely gossip based!Gossip about membershipGossip to replicate and repair dataGossip about last heard from time used to discard failed nodesGossip channel uses fixed bandwidth fixed rate, packets of limited sizeGossip 101Suppose that I know somethingIm sitting next to Fred, and I tell himNow 2 of us knowLater, he tells Mimi and I tell AnneNow 4This is an example of a push epidemicPush-pull occurs if we exchange data

Gossip scales very nicelyParticipants loads independent of sizeNetwork load linear in system sizeInformation spreads in log(system size) time% infected0.01.0Time Gossip in distributed systemsWe can gossip about membershipNeed a bootstrap mechanism, but then discuss failures, new membersGossip to repair faults in replicated dataI have 6 updates from CharlieIf we arent in a hurry, gossip to replicate data tooGossip about membershipStart with a bootstrap protocolFor example, processes go to some web site and it lists a dozen nodes where the system has been stable for a long timePick one at randomThen track processes Ive heard from recently and processes other people have heard from recentlyUse push gossip to spread the wordGossip about membershipUntil messages get full, everyone will known when everyone else last sent a messageWith delay of log(N) gossip roundsBut messages will have bounded sizePerhaps 8K bytesThen use some form of prioritization to decide what to omit but never send more, or larger messagesThus: load has a fixed, constant upper bound except on the network itself, which usually has infinite capacityAffinity Groups:peer membership thru consistent hashBack to Kelips: Quick reminder012301102302021N-Contact pointersNmembers per affinity groupidhbeatrtt3023490ms23032230msAffinity group viewgroupcontactNode2202ContactsHow Kelips worksGossip about everythingHeuristic to pick contacts: periodically ping contacts to check liveness, RTT swap so-so ones for better ones.

Node 102

Gossip data streamHmmNode 19 looks like a much better contact in affinity group 217519RTT: 235ms RTT: 6 msNode 175 is a contact for Node 102 in some affinity groupReplication makes it robustKelips should work even during disruptive episodesAfter all, tuples are replicated to N nodesQuery k nodes concurrently to overcome isolated crashes, also reduces risk that very recent data could be missed we often overlook importance of showing that systems work while recovering from a disruption

Chord can malfunction if the network partitions01231992022412552481081776430EuropeUSA01231992022412552481081776430Transient Network Partition so, who cares?Chord lookups can fail and it suffers from high overheads when nodes churnLoads surge just when things are already disrupted quite often, because of loadsAnd cant predict how long Chord might remain disrupted once it gets that wayWorst case scenario: Chord can become inconsistent and stay that wayThe Fine PrintThe scenario you have been shown is of low probability. In all likelihood, Chord would repair itself after any partitioning failure that might really arise. Caveat emptor and all that.Control traffic load generated by churnKelipsNoneO(Changes x Nodes)?O(changes)ChordPastryTake-Aways?Surprisingly easy to superimpose a hash-table lookup onto a potentially huge distributed system!Weve seen three O(log N) solutions and one O(1) solution (but Kelips needed more space)Sample applications?Peer to peer file sharingAmazon uses DHT for the shopping cartCoDNS: A better version of DNS

Distributed Hash Tables

Documents

Transcript of Distributed Hash Tables