Efficiency Tricks for Hashing and Blooming in Streaming Algorithms
-
Upload
marat-zhanikeev -
Category
Technology
-
view
357 -
download
0
description
Transcript of Efficiency Tricks for Hashing and Blooming in Streaming Algorithms
![Page 1: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/1.jpg)
![Page 2: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/2.jpg)
.
The Data Streaming Problem
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 2/23...
2/23
![Page 3: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/3.jpg)
.
Data Streaming Problem• based on the traditional Information Theory 01 02
• but a new formulation altogether 04
• data streaming: processes input in realtime (no storage), creatingspace efficient sketches on the output
• alternative to database, indexing, offline processing, etc. technologies
01 C.Shannon "A Mathematical Theory of Communication" The Bell System Tech.J (1948)
02 D.MacKey "Information Theory, Inference, and Learning Algorithms" Cambridge UniPress (2003)
04 S.Muthukrishnan "Data Streams: Algorithms and Applications" Theoretical Comp.Science (2005)
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 3/23...
3/23
![Page 4: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/4.jpg)
.
Data Streaming Problems
• fast hashing 08
• efficient blooming 09 10
• space-efficient streaming algorithms
Other Uses
Data Streaming
Other uses Bloom Filter
Other Types of Hashing Fast Hashing
08 D.Lemire+1 "Strongly Universal String Hashing is Fast" Cornell Techreport (2013)
09 F.Putze+2 "Cache- Hash- and Space-Efficient Bloom Filters" JEA Journal (2009)
10 A.Kirsch+1 "Less Hashing, Same Performance: Building a Better Bloom Filter" Inderscience (2007)
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 4/23...
4/23
![Page 5: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/5.jpg)
.
Hashing and Blooming
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 5/23...
5/23
![Page 6: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/6.jpg)
.
Hashing Technology
• perfect hashing• minimal perfect hashing
◦ applied to blooming, but relatively inefficient 11
• universal hashing← this is the one we use◦ but many efficiency tricks◦ bitwise fast hashing, etc.12
11 G.Antichi+4 "Blooming Trees for Minimal Perfect Hashing" GLOBECOM (2008)
12 F.Bonomi+4 "Bloom Filters via d-Left Hashing and Dynamic Bit Reassignment" 44th ACCCC (2006)
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 6/23...
6/23
![Page 7: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/7.jpg)
.
Hashing Quality Metrics
• uniform distribution• avalance condition
◦ change in one bit on the input changes about half of bits on the output
• no partial correlation◦ hard to achieve, head and tail bits have different qualities in common algorithms like
CRC24, etc.
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 7/23...
7/23
![Page 8: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/8.jpg)
.
Blooming Quality Metrics
• True Positive OK, butFalse Positive also possible
• an answer to the question of "have you seen this before?"
• time it takes to "fill up" a bloom structure -- useless afterwards
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 8/23...
8/23
![Page 9: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/9.jpg)
.
Bloom Filter Types
• stop additions filter
• delition filter
• counting filter
• .... very active research 09 10
• .... reality: most of them are inefficient!
09 F.Putze+2 "Cache- Hash- and Space-Efficient Bloom Filters" JEA Journal (2009)
10 A.Kirsch+1 "Less Hashing, Same Performance: Building a Better Bloom Filter" Inderscience (2007)
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 9/23...
9/23
![Page 10: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/10.jpg)
.
Efficiency
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 10/23...
10/23
![Page 11: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/11.jpg)
.
Efficiency (1) : Hash/Bloom
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 11/23...
11/23
![Page 12: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/12.jpg)
.
Efficiency (2) : Hash/Bloom• how many hash functions k?
k = ln2(mn
)≈ 0.6
m
n.
• the "fill-up" rate -- when it becomes useless
p =(1− 1
m
)kn≈ e
−knm .
• FP probability
pFP = (1− p)k ≈(1− e
−knm
)k≈ 1
2k,
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 12/23...
12/23
![Page 13: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/13.jpg)
.
Efficiency (2) : Hotspot Input• most data today hotspot distribution moded as a SB process
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 13/23...
13/23
![Page 14: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/14.jpg)
.
Efficiency (3) : DLL and Collissions
• a practicalalternative to perfecthashing
• catch and resolvecollissions usingsideways DLL
• hotspots: movechanged items to the topof DLL
• common in C/C++
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 14/23...
14/23
![Page 15: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/15.jpg)
.
Data Streaming Examples
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 15/23...
15/23
![Page 16: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/16.jpg)
.
Examples (1) : Heavy Hitterns
.Objective..
.FindingHeavy Hitters in a hotspot distributed input.
• find k most frequently accessed items in a list.
• good algorithms can be found in 04
04 S.Muthukrishnan "Data Streams: Algorithms and Applications" Theoretical Comp.Science (2005)
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 16/23...
16/23
![Page 17: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/17.jpg)
.
Examples (2) : Superspreaders
.Objective..
.
Superspreaders: detect items which access or are accessed byexceedingly many other items.
• computer viruses, botnets, etc.• one source, many destinations
• short lifespan
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 17/23...
17/23
![Page 18: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/18.jpg)
.
Examples (3) : M2M Patterns
.Objective..
.
M2M patterns: A more generic case of heavy hitters and superspreaders,but in this definition the patterns are not known in advance.
• m2m communication patterns
• space efficiency is important
• selective filtering -- pick only interesting m2m units
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 18/23...
18/23
![Page 19: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/19.jpg)
.
The Why : Practical Application -BigData
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 19/23...
19/23
![Page 20: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/20.jpg)
.
BigData: Today
05 K.Shvachko "HDFS Scalability: the Limits to Growth" the Magazine of USENIX (2012)
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 20/23...
20/23
![Page 21: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/21.jpg)
.
BigData Replay (new)
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 21/23...
21/23
![Page 22: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/22.jpg)
.
BigData on Multicore
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 22/23...
22/23
![Page 23: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/23.jpg)
.
That’s all, thank you ...
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23...
23/23
![Page 24: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/24.jpg)
.
[01] C.Shannon (1948)A Mathematical Theory of CommunicationThe Bell System Tech.J
[02] D.MacKey (2003)Information Theory, Inference, and Learning AlgorithmsCambridge UniPress
[03] A.Konheim (2010)Hashing in Computer Science: Fifty Years of Slicing and DicingWiley
[04] S.Muthukrishnan (2005)Data Streams: Algorithms and ApplicationsTheoretical Comp.Science
[05] K.Shvachko (2012)HDFS Scalability: the Limits to Growththe Magazine of USENIX
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23...
23/23
![Page 25: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/25.jpg)
.
[06] S.Heinz+2 (2002)Burst Tries: A Fast, Efficient Data Structure for String KeysACM TOIS
[07] M.Ramakrishna+1 (1997)Performance in Practice of String Hashing Functions5th ICDSAA
[08] D.Lemire+1 (2013)Strongly Universal String Hashing is FastCornell Techreport
[09] F.Putze+2 (2009)Cache- Hash- and Space-Efficient Bloom FiltersJEA Journal
[10] A.Kirsch+1 (2007)Less Hashing, Same Performance: Building a Better Bloom FilterInderscience
[11] G.Antichi+4 (2008)M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23
...
23/23
![Page 26: Efficiency Tricks for Hashing and Blooming in Streaming Algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020122/55562e76d8b42a5b528b4e9a/html5/thumbnails/26.jpg)
.
Blooming Trees for Minimal Perfect HashingGLOBECOM
[12] F.Bonomi+4 (2006)Bloom Filters via d-Left Hashing and Dynamic Bit Reassignment44th ACCCC
M.Zhanikeev -- [email protected] -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23...
23/23