Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
-
Upload
alondra-boothroyd -
Category
Documents
-
view
218 -
download
0
Transcript of Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Algorithms and data structures for big data,
what’s next?
Paolo FerraginaUniversity of Pisa
Is Big Data a buzz word ?
“Big Data” vs “Grid Computing”
VLDB does exist since 1992
Big data, big impact !
Big data are everywhere !
[Procs OSDI 2006] No SQL
HyperTable
CassandraHadoop
Cosmos
From macro to micro-users
Energy is related to time/memory-accesses in an intricated manner, so the issue “algo + memory levels” is a key for everyday users, not only big players
Our driving moral...
Big steps come from theory
... but do NOT forget practice ;-)
Our running example
(String-)Dictionary Problem
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support prefix searches for a
pattern P.
Exact search Hashing
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
Performance:• Search ≈ O(|P|) time
• Space ≈ O(N)
Dominated the string-matching scene in the ‘80s-90s
Most known is the Suffix Tree
Software engineers objected:• Search: random memory accesses
• Space: pointers + strings
Lexicographic search
P = systo
Timeline: theory and practice...
‘60
Trie
’90
’70-
’80
Suffix Tree
What aboutSoftware Engineers ??
What did systems implement?
Used the Compacted trie, of course, but with 2 other concerns because of large data
1° issue: space concern
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
3345%
0 http://checkmate.com/All/Natural/Washcloth.html...
systile syzygetic syzygial syzygy….2,zygetic 5,ial 5,y
FrontCoding
2° issue: Disk memorytrack
BCPU Internal
Memory
1
2 main features:• Seek time = I/Os are costly
• Blocked access = B items per I/O
Count I/Os
Strings may be arbitrarily long
Why are strings challenging ?
….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory One main limitation:
Sampling rate & lengths of sampled strings
Trade-off btw speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 disk access
• Space ≈ Front-coding over buckets
(Prefix) B-tree
B B
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
Space+
Hierarchical Memory
Do we need to tradespace by I/Os ?
1995
String B
-tree
An old idea: Patricia Trie
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
2 2
0
y
s
1 z
stile zyg
5
etic
ial
y
aibelyte
czecin
omo
[Morrison, J.ACM 1968]
Disk
A new (lexicographic) search
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
2 2
0
y
s
1 z
sz
5
e
i
y
a
c
o
Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation
Lexicographic search:P = syzytea
01
2 5 yg
Lexicographic position
Only 1 string is checked on disk
Trie Space ≈ #strings, NOT their
length
[Ferragina-Grossi, J.ACM 1999]
Disk
The String B-tree
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
PTSearch(P)
•O((p/B) logB K) I/Os
O(occ/B) I/OsIt is dynamic...
Check 1 string = O(p/B) I/Os
O(logB K) levels
+
Lexicographic position of P
[Ferragina-Grossi, J.ACM 1999]
> 15 US-patents cite it !!
Knuth, vol 3°, pag. 489: “elegant”
I/O-aware algorithms & data structures
[CACM 1988]
[2006]
Huge literature !!
I/Os was the
main concern
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
CPUregisters
L1 L2 RAM
Cache
HD net
Cache-oblivious solutions, aka parameter-free algo+ds Anywhere, anytime, anyway... I/O-optimal !!
Not just 2 memory levels
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
Space
Cache-oblivious data structures
Compresseddata structures
Not just 2 memory levels
Can we “automate” and “guarantee” the process ?
A challenging question [Ken Church, AT&T
1995]
Software Engineers use “squeezing heuristics” that
compress data and still support fast access to them
Aka: Compressed self-indexes
Opportunistic Data Structures with Applications
P. Ferragina, G. Manzini
Space for text+index space for compressed text
only ( Hk) Query/Decompression time theoretically
(quasi-)optimal
...now, J.ACM 2005
The big (unconscious) step...
[Burrows-Wheeler, 1994]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
Highly compressible, but…
The big (unconscious) step...
[Burrows-Wheeler, 1994]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
T
bzip2 = BWT + other simple compressors
bwt(T)
From practice to theory...
FM-index = BWT is searchable
...or Suffix Array is compressible
• Space = |T| Hk + o(|T|) bits
• Search(P) = O(p + occ * polylog(|T|))
Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]
[Ferragina-Manzini, IEEE Focs ‘00]
10 pi#mississi p 9 ppi#mississ i 7 sippi#missi s 4 sissippi#mi s 6 ssippi#miss i 3 ssissippi#m i
5 issippi#mis s
1 mississippi # 2 ississippi# m
12 #mississipp i11 i#mississip p 8 ippi#missis s
bwt(T)sa(T)
Compressed & Searchable data formats
After our paper in FOCS 2000, about texts
We find nowdays compressed indexes for: Trees Labeled trees and graphs Functions Integer Sets Geometry Images ...
From theory to practice…
December 2003
ACM J. on Experimental Algorithmics, 2009
> 103 faster than Smith-W.
>102 faster than SOAP & Maq
What about the Web ?[Ferragina-Manzini, ACM WSDM 2010]
An XML excerpt<dblp> <book>
<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>
</book> <article>
<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>
</article>...</dblp>
IEEE FOCS 2005 WWW 2006 J. ACM 2009 US Patent 2012
A tree interpretation
XML document exploration Tree navigation XML document search Labeled subpath
searches
XBW
transform
XBW Transform: Some performance figures
Nu
m s
earc
hes p
er
secon
d
Xerces better on smaller files
larger and larger datasets
Xerces worse on larger files
Xerces uses10x space
Where we are nowadays
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
Cache-oblivious data structures
Compresseddata structures
Something is known... yet very preliminary
Lower Bounds derived from Geometry
Text search = 2d Range Search
New food for research..
[E. Gal, S. Toledo. ACM Comp. Surv., 2005]
[Ajwani et al, WEA 2009]
Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling
Self-adjusting or Weighted design Time ops depend on some (un/known) distribution
Challenge: no pointers, self-adjust (perf) vs compression (space)
[Ferragina et al, ESA 2011]
40Gb, about 100$
The energy challenge
IEEE Computer, 2007
Browsing a web site
Javascript framework
Prototype Dojo jQuery
Chrome best choice best choice 1,5%
FireFox 2,5% 4,8% 4,3%
IE 10,2% 8,5% 11%
The most
used!The most
used!
Yet today, it is a problem...
Apple is still working on the battery life problem: “The
recent iOS software update addressed many of the battery
issues that some customers experienced on their iOS 5 devices.
We continue to investigate a few remaining issues.” (nov 2011,
wired.com)
“ Windows 8's power hygiene: the scheduler will ignore the unused software” (Feb 2012, MSDN)
Energy-aware Algo+Ds ?
Locality pays off
Memory-level impacts
I/Os and compression
are obviously important
BUT
here there is a new twist
MIPS per Watt ?Battery life !!
Who cares whether your application:1.is y% slower than optimal, but it is more energy efficient ?
2.takes x% more space than optimal, but it is more energy efficient ?
Idea:Multi-objective optimization in data-structure design
Approach in aprincipled way
A preliminary step
Took inspiration from BigTable (Google), ...
Design a compressed storage scheme that can trade in a principled way between
space vs decompression time [vs energy efficiency]
Requirements: gzip-like compression [like Snappy or lz4 by
Google]
Goal: Fix the space occupancy, find the best compression
that achieves that space and minimizes the decompression time (or vice versa)
[abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] <2,1> <0,d> <7,4>
Copy back new char Copy back
A preliminary step...
Modeled as a Constrained Shortest Path problem: Nodes = one per char of the text to be compressed
Edges = single char or copy back substrings
2 edge weights = decompression time (t) and compressed space (c)
NP-hard in generalThis special case is POLY: O(n3)
n is huge
m might be n2
LZ-parsing = Path from 1 to 12
We solved heuristically (Lagrangian Dual) and provably (Path Swap)
A preliminary step...
String MatchingRAM model, char cmp and time
1990s: Data BasesHierarchical memories and I/Os
2000s: Data CompressionSpace reduction in indexesand entropy space-bounds
Graph TheorySpace reduction in compressors
OptimizationMulti-objective design and joules
2010s: Computational GeometryLower bounds on indexes
New upper bounds on I/Os, entropy
Nowadays…
We mainly commented:
A quote to conclude
“The distance between theory and practice is closer in theory
than in practice”[Y. Matias, Google]
Big steps come from theory
... but do NOT forget practice ;-)
That’s all !