Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

53
Algorithms and data structures for big data, what’s next? Paolo Ferragina University of Pisa

Transcript of Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Page 1: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Algorithms and data structures for big data,

what’s next?

Paolo FerraginaUniversity of Pisa

Page 2: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Is Big Data a buzz word ?

Page 3: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

“Big Data” vs “Grid Computing”

Page 4: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

VLDB does exist since 1992

Page 5: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Big data, big impact !

Page 6: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Big data are everywhere !

Page 7: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

[Procs OSDI 2006] No SQL

HyperTable

CassandraHadoop

Cosmos

Page 8: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Page 9: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Page 10: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

From macro to micro-users

Energy is related to time/memory-accesses in an intricated manner, so the issue “algo + memory levels” is a key for everyday users, not only big players

Page 11: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Our driving moral...

Big steps come from theory

... but do NOT forget practice ;-)

Page 12: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Our running example

Page 13: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

(String-)Dictionary Problem

Given a dictionary D of K strings, of total

length N, store them in a way that we can

efficiently support prefix searches for a

pattern P.

Exact search Hashing

Page 14: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

Performance:• Search ≈ O(|P|) time

• Space ≈ O(N)

Dominated the string-matching scene in the ‘80s-90s

Most known is the Suffix Tree

Software engineers objected:• Search: random memory accesses

• Space: pointers + strings

Lexicographic search

P = systo

Page 15: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Timeline: theory and practice...

‘60

Trie

’90

’70-

’80

Suffix Tree

What aboutSoftware Engineers ??

Page 16: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

What did systems implement?

Used the Compacted trie, of course, but with 2 other concerns because of large data

Page 17: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

1° issue: space concern

http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html...

0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html

3345%

0 http://checkmate.com/All/Natural/Washcloth.html...

systile syzygetic syzygial syzygy….2,zygetic 5,ial 5,y

FrontCoding

Page 18: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

2° issue: Disk memorytrack

BCPU Internal

Memory

1

2 main features:• Seek time = I/Os are costly

• Blocked access = B items per I/O

Count I/Os

Strings may be arbitrarily long

Why are strings challenging ?

Page 19: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

systile szaielyite

CTon a sample

2-level indexing

Disk

InternalMemory One main limitation:

Sampling rate & lengths of sampled strings

Trade-off btw speed vs space (because of bucket size)

2 advantages:• Search ≈ typically 1 disk access

• Space ≈ Front-coding over buckets

(Prefix) B-tree

B B

Page 20: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Timeline: theory and practice...

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

Space+

Hierarchical Memory

Do we need to tradespace by I/Os ?

1995

String B

-tree

Page 21: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

An old idea: Patricia Trie

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0

y

s

1 z

stile zyg

5

etic

ial

y

aibelyte

czecin

omo

[Morrison, J.ACM 1968]

Disk

Page 22: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

A new (lexicographic) search

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0

y

s

1 z

sz

5

e

i

y

a

c

o

Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation

Lexicographic search:P = syzytea

01

2 5 yg

Lexicographic position

Only 1 string is checked on disk

Trie Space ≈ #strings, NOT their

length

[Ferragina-Grossi, J.ACM 1999]

Disk

Page 23: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

The String B-tree

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

PTSearch(P)

•O((p/B) logB K) I/Os

O(occ/B) I/OsIt is dynamic...

Check 1 string = O(p/B) I/Os

O(logB K) levels

+

Lexicographic position of P

[Ferragina-Grossi, J.ACM 1999]

> 15 US-patents cite it !!

Knuth, vol 3°, pag. 489: “elegant”

Page 24: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

I/O-aware algorithms & data structures

[CACM 1988]

[2006]

Huge literature !!

I/Os was the

main concern

Page 25: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Timeline: theory and practice...

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

CPUregisters

L1 L2 RAM

Cache

HD net

Cache-oblivious solutions, aka parameter-free algo+ds Anywhere, anytime, anyway... I/O-optimal !!

Not just 2 memory levels

Page 26: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Timeline: theory and practice...

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

Space

Cache-oblivious data structures

Compresseddata structures

Not just 2 memory levels

Page 27: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Can we “automate” and “guarantee” the process ?

A challenging question [Ken Church, AT&T

1995]

Software Engineers use “squeezing heuristics” that

compress data and still support fast access to them

Page 28: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Aka: Compressed self-indexes

Opportunistic Data Structures with Applications

P. Ferragina, G. Manzini

Space for text+index space for compressed text

only ( Hk) Query/Decompression time theoretically

(quasi-)optimal

...now, J.ACM 2005

Page 29: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

The big (unconscious) step...

[Burrows-Wheeler, 1994]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

Highly compressible, but…

Page 30: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

The big (unconscious) step...

[Burrows-Wheeler, 1994]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

T

bzip2 = BWT + other simple compressors

bwt(T)

Page 31: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

From practice to theory...

FM-index = BWT is searchable

...or Suffix Array is compressible

• Space = |T| Hk + o(|T|) bits

• Search(P) = O(p + occ * polylog(|T|))

Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]

[Ferragina-Manzini, IEEE Focs ‘00]

10 pi#mississi p 9 ppi#mississ i 7 sippi#missi s 4 sissippi#mi s 6 ssippi#miss i 3 ssissippi#m i

5 issippi#mis s

1 mississippi # 2 ississippi# m

12 #mississipp i11 i#mississip p 8 ippi#missis s

bwt(T)sa(T)

Page 32: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Compressed & Searchable data formats

After our paper in FOCS 2000, about texts

We find nowdays compressed indexes for: Trees Labeled trees and graphs Functions Integer Sets Geometry Images ...

Page 33: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

From theory to practice…

December 2003

Page 34: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

ACM J. on Experimental Algorithmics, 2009

Page 35: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

> 103 faster than Smith-W.

>102 faster than SOAP & Maq

Page 36: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

What about the Web ?[Ferragina-Manzini, ACM WSDM 2010]

Page 37: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

An XML excerpt<dblp> <book>

<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>

</book> <article>

<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>

</article>...</dblp>

IEEE FOCS 2005 WWW 2006 J. ACM 2009 US Patent 2012

Page 38: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

A tree interpretation

XML document exploration Tree navigation XML document search Labeled subpath

searches

XBW

transform

Page 39: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

XBW Transform: Some performance figures

Nu

m s

earc

hes p

er

secon

d

Xerces better on smaller files

larger and larger datasets

Xerces worse on larger files

Xerces uses10x space

Page 40: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Where we are nowadays

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

Cache-oblivious data structures

Compresseddata structures

Something is known... yet very preliminary

Lower Bounds derived from Geometry

Text search = 2d Range Search

Page 41: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

New food for research..

[E. Gal, S. Toledo. ACM Comp. Surv., 2005]

[Ajwani et al, WEA 2009]

Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling

Self-adjusting or Weighted design Time ops depend on some (un/known) distribution

Challenge: no pointers, self-adjust (perf) vs compression (space)

[Ferragina et al, ESA 2011]

40Gb, about 100$

Page 42: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

The energy challenge

IEEE Computer, 2007

Page 43: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Browsing a web site

Javascript framework

Prototype Dojo jQuery

Chrome best choice best choice 1,5%

FireFox 2,5% 4,8% 4,3%

IE 10,2% 8,5% 11%

The most

used!The most

used!

Page 44: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Page 45: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Yet today, it is a problem...

Apple is still working on the battery life problem: “The

recent iOS software update addressed many of the battery

issues that some customers experienced on their iOS 5 devices.

We continue to investigate a few remaining issues.” (nov 2011,

wired.com)

“ Windows 8's power hygiene: the scheduler will ignore the unused software” (Feb 2012, MSDN)

Page 46: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Energy-aware Algo+Ds ?

Locality pays off

Memory-level impacts

I/Os and compression

are obviously important

BUT

here there is a new twist

Page 47: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

MIPS per Watt ?Battery life !!

Who cares whether your application:1.is y% slower than optimal, but it is more energy efficient ?

2.takes x% more space than optimal, but it is more energy efficient ?

Idea:Multi-objective optimization in data-structure design

Approach in aprincipled way

Page 48: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

A preliminary step

Took inspiration from BigTable (Google), ...

Design a compressed storage scheme that can trade in a principled way between

space vs decompression time [vs energy efficiency]

Requirements: gzip-like compression [like Snappy or lz4 by

Google]

Goal: Fix the space occupancy, find the best compression

that achieves that space and minimizes the decompression time (or vice versa)

[abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] <2,1> <0,d> <7,4>

Copy back new char Copy back

Page 49: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

A preliminary step...

Modeled as a Constrained Shortest Path problem: Nodes = one per char of the text to be compressed

Edges = single char or copy back substrings

2 edge weights = decompression time (t) and compressed space (c)

NP-hard in generalThis special case is POLY: O(n3)

n is huge

m might be n2

LZ-parsing = Path from 1 to 12

We solved heuristically (Lagrangian Dual) and provably (Path Swap)

Page 50: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

A preliminary step...

Page 51: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

String MatchingRAM model, char cmp and time

1990s: Data BasesHierarchical memories and I/Os

2000s: Data CompressionSpace reduction in indexesand entropy space-bounds

Graph TheorySpace reduction in compressors

OptimizationMulti-objective design and joules

2010s: Computational GeometryLower bounds on indexes

New upper bounds on I/Os, entropy

Nowadays…

We mainly commented:

Page 52: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

A quote to conclude

“The distance between theory and practice is closer in theory

than in practice”[Y. Matias, Google]

Big steps come from theory

... but do NOT forget practice ;-)

Page 53: Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

That’s all !