Folie 1
Fateme ShiraziSpring 20101
Statistical structures for Internet-scale data managementAuthors: Nikos Ntarmos, Peter Triantafillou, G. WeikumOutline2IntroductionBackground : Hash sketchesCompute aggregates and building histogramsImplementationResultsConclusion
Peer-to-Peer (P2P)3File sharing in overlay networks
Millions of users (peers) provide storage and bandwidth for searching and fetching files
Motivation4In P2P file-sharing often the total number of (unique) documents shared by their users is needed
Distributed P2P search engines need to evaluate the significance of keywords
the ratio of indexed documents containing each keyword to the total number of indexed documents
Motivation5Internet-scale information retrieval systems need a method to deduce the rank/score of data items.
Sensor networks need methods to compute aggregates
Traditionally query optimizers rely on histograms over stored data, to estimate the size of intermediate results
Overview Sketch6A large number of nodes, form the systems infrastructure
Contribute and/or store data items ,involved in operations such as computing synopses and building histograms
In general, queries do not affect all nodes
Compute aggregation functions over data sets dynamically by a filter predicate of the query
Problem Formulation7Relevant data items stored in unpredictable ways in a subset of all nodes
A large number of different data sets expected to exist, stored at (perhaps overlapping) subsets of the network
And, relevant queries and synopses may be built and used over any of these data setsComputational Model8Data stored in P2P network is structured in relations
Each R consists of (k+l) attr. or columns R(a1,,ak,b1,,bl)
Either one of the attributes of the tuple, or calculated otherwise (e.g. a combination of its attributes)
attr1attr2attr3Outline9IntroductionBackground : Hash sketchesCompute aggregates and building histogramsExperimental setupResultsConclusion
Distributed Hash Tables10A family of structured P2P network overlays exposing a hash-table-like interface(lookup service)
Examples of DHTs include Chord, Kademlia, Pastry, CAN
Any node can efficiently retrieve a value with given key
Chord11Nodes are assigned identifiers from a circular ID space, computed as the hash of IP address
Node-ID space among nodes partitioned, so that each node is responsible for a well-defined set (arc) of identifiers
Each item is also assigned a unique identifier from the same ID space
Stored at the node whose ID is closest to the items IDHash Sketches12Estimating the number of distinct items in D of data in a database
For application domains which need counting distinct elements:
Approximate query answering in very large databases, Data mining on the Internet graph Stream processing
Hash Sketches13A hash sketch consists of a bit vector B[] of length L
In order to estimate the number n of distinct elements in D ,(h(d)) is applied to all d D and record the results in the bitmap vector B[0 . . . L1]
000011d1d2d3d4000011LSBMSBPartially copied from slides of the author000000LSBMSBHash sketches: Insertionsh()()PRNnPRNn-1...PRN4PRN3PRN2PRN1L-bit Pseudo-RandomNumbersdndn-1...d4d3d2d1Data ItemsnHash sketch(Bit vector B)bL-1bL...b1b0L+1h()10111()my item 1 keymy item 2 keymy item 3 keymy item 4 key1001001101100111114Copied from slides of the author14Hash Sketches15Since h() distributes values uniformly over [0, 2L )
P((h(d)) = k) = 2k1
R =position of the least-significant 0-bit in B, then2R ~ n
d1d2d3d4000011|D| ~ 22 = 4Partially copied from slides of the authorDistributing Data Synopses16(1) the conservative but popular rendezvous based approach
(2) the decentralized way of DHS, in which no node has some sort of special functionality
Partially copied from slides of the authorMapping DHS bits to DHT Nodes N1N8N14N21N32N56N51N48N42N38Bit 0Bit 1Bit 2Bit 3Bit Copied from slides of the author1717DHS : CountingN1N8N14N21N32N56N51N48N42N38Counting nodeBits >3 not setBit 2 not set.RetryingBit 2 not setBit 1 not set.RetryingBit 1 set!Copied from slides of the author1818Outline19IntroductionBackground : Hash sketchesCompute aggregates and building histogramsExperimental setupResultsConclusion
Computing Aggregates20COUNT-DISTINCT: Estimation of the number of (distinct) items in a multi-set
COUNT: adding the tuple IDs to the corresponding synopsis, instead of the values of the column in question
SUM : each node locally computes the sum of values of the column tuples it stores, populates a local hash sketch
AVG: Consists of estimating the SUM and COUNT of the column and then taking their ratioCOUNT-DISTINCT21Both rendezvous-based hash sketches and DHS applicable to estimation of the number of (distinct) items in a multiset
Assume the estimation of the number of distinct values in a column C of a relation R stored in our Internet-scale data management system is wanted
Counting with the Rendezvous Approach22Nodes first compute a rendezvous ID. (attr1 h() 47 )
Then compute locally the synopsis and send it to the node whose ID is closest to the above ID (rendezvous node)
The rendezvous node responsible for combining the individual synopses (by bitwise OR) into the global synopsis
Interested nodes can then acquire the global synopsis by querying the rendezvous nodeStep 123
Step 224
Step 325
Counting with DHS26In the DHS-based case, nodes storing tuples of R insert them into the DHS, by:
(1)Nodes hash their tuples and compute (hash) for each tuple
(2) For each tuple,nodes send a set-to-1 to a random ID in the corresponding arc
(3) Counting consists of probing random nodes in arcs corresponding to increasing bit positions until 0-bit is foundStep 127
Step 228
Step 329
Histograms30The most common technique used by commercial databases as a statistical summary
An approximation of the distribution of values in base relations.
For a given attribute/column, a histogram is a grouping of attribute values into bucketsSalarySalaryAgeConstructing histogram types31Equi-Width histograms
The most basic histogram variant
Partitions the attribute value domain into cells (buckets) of equal spread
Assigns to each the number of tuples with an attribute value.
31Other histogram types32Average shifted Equi-Width histograms ,ASH
Consist of several EWH with different starting positions in value space
Frequency of each value in a bucket computed as the average of estimations given by histogram
Equi-Depth histograms
In an Equi-Depth histogram all buckets have equal frequencies but not (necessarily) equal spreads
Outline33IntroductionBackground : Hash sketchesCompute aggregates and building histogramsImplementation ResultsConclusion
Implementation341.Generating the workload
2. Populating the network with peers
3. Randomly assigning data tuples from the base data to nodes in the overlay
4. Then inserting all nodes into the P2P
5. Selecting random nodes ,reconstructing histograms and computing aggregatesMeasures of Interest35To consider
(1) The fairness of the load distribution across nodes in the network
(2)The accuracy of the estimation itself
(3)The number of hops are considered to do the estimation
To show the trade-off of scalability vs. performance/load distribution between the DHS and rendezvous-based approaches
Fairness36To compute the fairness, the load on any given node as the insertion /query/probe hits on the node is measured
Number of times this node is target of insertion/query/probe opera
A multitude of metrics are used. More specifically :
The Gini Coefficient The Fairness IndexMaximum and total loads for DHS- and rendezvous based approaches
The Gini Coefficient37Mean of the absolute difference of every possible pair .
Takes values in the interval [0, 1), where a GC value of 0.0 is the best possible state, with 1.0 being the worst
The Gini Coefficient roughly represents the amount of imbalance in the system
Gini = A/(A+B) A BEstimation error38Mean error of the estimation is reported
Computed as percentage
By the distributed estimation differed to the estimated aggregate computed in a centralized manner
(i.e. as if all data was stored on a single host)Hop-count Costs39The per-node average hop count for inserting all tuples to the distributed synopsis is measured and shown The per-node hop count costs are higher for the DHS-based approach
Outline40IntroductionBackgroundCompute aggregates and building histogramsImplementationResultsConclusion
Results41The hop-count efficiency and the accuracy of rendezvous-based hash sketches and of the DHS is measured
Initially single-attribute relations is created, with integer values in the intervals [0, 1000)
following either a uniform distribution (depicted as a Zipf with equal to 0.0) or a shuffled Zipf distribution with equal to 0.7, 1.0, and 1.2
Total query load (node hits) over time42Load distribution43The extra hop-count cost of the DHS-based approach pays back when it comes to load distribution fairness
The load on a node, the number of times it is visited (a.k.a. node hits) during data insertion and/or query processing.
Gini Coefficient44 Rendezvous approach DHS approach
Evolution of the Gini coefficient45In the rendezvous based approach a single node has all the query load
The DHS-based approaches ,0.5, which equal the GC values of the distribution of the distances between consecutive nodes in the ID space
Thus the best respective values by any algorithm using randomized assignment of items to nodesEvolution of the Gini coefficient46
Error for Computing COUNT Aggregate47 Rendezvous approach DHS approach
In both cases, error due to use of hash sketches Both approaches exhibit the same average error As expected, the higher the number of bitmaps in the synopsis, the better the accuracy
Insertion hop count 48 Rendezvous approach DHS approach
The insertion hop-count cost for all aggregates
Hop count costs are higher for the DHS-based approach by appr.8 for both the insertion and query cases
Outline49IntroductionBackground : Hash sketchesCompute aggregates and building histogramsExperimental setupResultsConclusion
Conclusion50A framework for distributed statistical synopses for Internet-scale networks such as P2P systems
Extending centralized settings techniques towards distributed settings
Developing DHT based higher-level synopses like Equi-Width, ASH, and Equi-Depth histograms
Conclusion51Fully distributed cardinality estimator, providing scalability, efficiency, accuracy
Constructed efficiently and scaling well with growing network size, while having high accuracy
Providing trade-off between accuracy and construction /maintenance costs
Totally balanced (access and maintenance) load on nodes
Future research52Examining auto-tuning capabilities for the histogram inference engine
Integrating it with Internet- scale query processing systems
To look into implementing for other types of synopses, aggregates, and histogram variants
Finally, using this tools for approximate query answering
Thank you 53
Top Related