Web Algorithmics

Dictionary-based compressors

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)

<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

LZ77 Decoding

Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

Lempel-Ziv Algorithms

Keep a “dictionary” of recently-seen strings.

The differences are: How the dictionary is stored How it is extended How it is indexed How elements are removed

LZ-algos are asymptotically optimal, i.e. their compression ratio goes to H(S) for n !!

No explicitfrequency estimation

You find this at: www.gzip.org/zlib/

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so that next match is better

Hash Table for speed-up searches on triplets

Triples are coded with Huffman’s code

Web Algorithmics

Some special compressorsSpatial vs Temporal Locality

code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)

0000...........0 x in binary Length-1

Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original sequence:

0001000001100110000011101100111

8 6 3 59 7

Streaming compression

Still you need to determine and sort all terms….Can we do everything in one pass ?

Move-to-Front (MTF): As a freq-sorting approximator As a caching strategy As a compressor

Run-Length-Encoding (RLE): FAX compression

Move to Front Coding

Transforms a char sequence into an integer sequence, that can then be var-length coded

Start with the list of symbols L=[a,b,c,d,…] For each input symbol s

1) output the position of s in L 2) move s to the front of L

Properties: Exploit temporal locality, and it is dynamic X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2

There is a memory

MTF: how good is it ?

Encode the integers via -coding:|(i)| ≤ 2 * log i + 1

Put in the front and consider the cost of encoding:

)()log(1

]1log*2[)log(x x

NnOBy Jensen’s:

]1)(*2[*)log( 0 XHNO

No much worse than Huffman...but it may be far better

)1()(*2][ 0 OXHmtfLa

Run Length Encoding (RLE)

If spatial locality is very high, then

abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)

In case of binary strings just numbers and one bit

Properties:

Exploit spatial locality, and it is a dynamic code

X = 1n 2n 3n… nn

Huff(X) = O(n2 log n) > Rle(X) = O( n (1+log n) )

There is a memory

Web Algorithmics

Burrows-Wheeler Transform

The big (unconscious) step...

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The Burrows-Wheeler Transform (1994)

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

A famous example

Muchlonger...

Compressing L seems promising...

Key observation: L is locally

homogeneousL is highly compressible

Algorithm Bzip :

Move-to-Front coding of

Run-Length coding

Statistical coder

Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

BWT matrix

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

How to compute the BWT ?

ipssm#pissii

1185211097463

L[3] = T[ 7 ]

We said that: L[i] precedes F[i] in T

Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

1185211097463

Elegant but inefficient

Obvious inefficiencies:• (n2 log n) time in the worst-case• (n log n) cache misses or I/O faults

Input: T = mississippi#

i ssippi#mis s

# mississipp ii #mississip pi ppi#missis s

Take two equal L’s chars

How do we map L’s onto F’s chars ?

... Need to distinguish equal chars in F...

Rotate rightward their rows

Same relative order !!

unknown

A useful tool: L F mapping

T = .... #

i #mississip p

i ssippi#mis s

The BWT is invertible

# mississipp i

i ppi#missis s

F Lunknown

1. LF-array maps L’s to F’s chars

2. L[ i ] precedes F[ i ] in T

Two key properties:

Reconstruct T backward:

InvertBWT(L)

Compute LF[0,n-1];r = 0; i = n;while (i>0) { T[i] = L[r]; r = LF[r]; i--; }

RLE0 = 03141041403141410210

An encoding example

T = mississippimississippimississippi

L = ipppssssssmmmii#pppiiissssssiiiiii

Mtf = 020030000030030200300300000100000

Mtf = [i,m,p,s]

# at 16

Bzip2-output = Arithmetic/Huffman on ||+1 symbols...

... plus (16), plus the original Mtf-list (i,m,p,s)

Mtf = 030040000040040300400400000200000Alphabe

Bin(6)=110, Wheeler’s code

You find this in your Linux distribution

Web Algorithmics

Documents

Transcript of Web Algorithmics

Algorithmics - Department of Computer Science

Algorithmics APIIT 2012 Assignment

Multivariate Algorithmics for NP-Hard String Problems

Priority queue Advanced Algorithmics (6EAP)

Algorithmics 13 Parallel Algorithms.ppt

Massive Data Algorithmics - Sharif University of Technologyce.sharif.edu/.../root/massivedata.slides.pdf · Lecture 1: Introduction Massive Data Algorithmics Lecture 1: Introduction.

Organisation and Basics of Algorithmics - LMU

Web Algorithmics - Dipartimento di Informaticapages.di.unipi.it/ferragina/Teach/Copie_Vecchie... · Ranker Page Analizer text Structure auxiliary Indexer. Paolo Ferragina, Web Algorithmics

Algorithmics - Lecture 131 LECTURE 13: Backtracking.

2017 Algorithmics Written examination

Algorithmics 08 Heaps.ppt

Parameterized Algorithmics for Computational Social Choice ...fpt.akt.tu-berlin.de/.../Parameterized_Algorithmics_for_Computational_Social_Choice... · Parameterized Algorithmics

Introduction and overview of FPT algorithmics

Algorithmics Algo BR00708 RiskServiceOView

Brassard Bratley Fundamentals of Algorithmics ES

Theory of Computation - Laboratory for Algorithmics

ADT –asscociativearray Algorithmics(6EAP) associative ...

Algorithmics and Optimization

2018 VCE Algorithmics (HESS) examination report

Algorithmics Algorithmics Research on Knowledge Research on ...