Fast Parallel Algorithms for Universal Lossless Source Coding

Fast Parallel Algorithms for Universal Lossless Source Coding

Dror Baron

CSL & ECE Department, UIUC

[email protected]

Ph.D. Defense – February 18, 2003

2

Overview• Motivation, applications, and goals• Background:

– Source models– Lossless source coding + universality– Semi-predictive methods

• An O(N) semi-predictive universal encoder• Two-part codes

– Rigorous analysis of their compression quality– Application to parallel compression of Bernoulli sequences

• Parallel semi-predictive (PSP) coding– Achieving a work-efficient algorithm– Theoretical results

• Summary

3

Motivation

• Lossless compression: text files, facsimiles, software executables, medical, financial, etc.

• What do we want in a compression algorithm?– Universality: adaptive to a large class of sources

– Good compression quality

– Speed: low computational complexity

– Simple implementation

– Low memory use

– Sequential vs. offline

4

Why Parallel Compression ?

• Some applications require high data rates:– Compressed pages in virtual memory– Remote archiving + fast communication links– Real-time compression in storage systems– Power reduction for interconnects on a circuit board

• Serial compression is limited by the clock rate

5

Room for Improvement and Goals• Previous Art:

– Serial universal source coding methods have reached the bounds on compression quality [Willems1998,Rissanen1999]

– Parallel source coding algorithms have high complexity and/or poor compression quality

• Naïve parallelization compresses poorly• Parallel dictionary compression [Franszek et. al.1996]• Parallel context tree weighting [Stassen&Tjalkens2001,Willems2000]

• Research Goals: “good” parallel compression algorithm– Work-efficient: O(N/B) time with B computational units– Compression quality: as good as best serial methods (almost!)

6

Main Contributions• BWT-MDL (O(N) universal encoder):

– An O(N) algorithm that achieves Rissanen’s redundancy bounds on best achievable compression

– Combines efficient prefix tree construction with semi-predictive approach to universal coding

• Fast Suffix Sorting (not in this talk):– Core algorithm is very simple (can be implemented in VLSI)– Worst-case complexity O(N log0.5(N)) – Competitive with other suffix sorting methods in practice

• Two-Part Codes: – Rigorous analysis of their compression quality– Application to distributed/parallel compression– Optimal two-part codes

• Parallel Compression Algorithm (not in this talk):– Work-efficient O(N/B) algorithm– Compression loss is roughly B log(N/B) bits

7

Source Models• Binary alphabet X={0,1}, sequence x XN

• Bernoulli Model:• i.i.d. model • p(xi=1)=

• Order-K Markov Model:• Previous K symbols called context• Context-dependent conditional probability for next symbol• More flexible than Bernoulli• Exponentially many states

8

Context Tree Sources

More flexible than BernoulliMore compact than MarkovParticularly good for textWorks for M-ary alphabet

• State = context + conditional probabilities• Example: N=11, x=01011111111

root0

P(xn+1=1|0)=0.8

01

11

P(xn+1=1|01)=0.4

P(xn+1=1|11)=0.9

0

10

1

leaf

internal node

9

Review of Lossless Source Coding

• Stationary ergodic sources• Entropy rate H=limN H(x)/N• Asymptotically, H is the lowest attainable per-symbol

rate

• Arithmetic coding:– Probability assignment p(x)– Coding length l(x)=-log(p(x))+O(1)– Can achieve entropy + O(1) bits

10

Universal Source Coding

• Source statistics are unknownNeed probability assignment p(x)Need to estimate source modelNeed to describe estimated source (explicitly or implicitly)

• Redundancy: excess coding length above entropy

(x)=l(x)-NH

11

Redundancy Bounds

• Rissanen’s bound (K unknown parameters):E[(x)] > (K/2) [log(N)+O(1)]

• Worst-case redundancy for Bernoulli sequences (K=1): (x*)=maxxXN {(x)} 0.5 log(N/2)

• Asymptotically, (x)/N 0• In practice, e.g., text, the number of parameters

scales almost linearly with N Low redundancy is still essential

12

Semi-Predictive Approach• Semi-predictive methods describe x in two phases:

– Phase I: find a “good” tree source structure S and describe it using codelength lS

– Phase II: encode x using S with probability assignment pS(x)

• Phase I: estimate minimum description length (MDL) tree source model S*=arg min {lS –log(pS(x))}

yS*

Encoder

MDL Estimator

x

Phase I Phase II

13

Semi-Predictive Approach - Phase II• Sequential encoding of x given S*

– Determine which state s of S* generated symbol xi

– Assign xi a conditional probability p(xi|s)

– Arithmetic encoding

• p(xi|s) can be based on previously processed portion of x, quantized probability estimates, etc.

ysDetermine s Assign p(xi|s)xip(xi|s) Arithmetic

Encoder

S*

14

Context Trees

node 0

node 01

node 11

root

unique

sentinel

• We will provide an O(N) semi-predictive algorithm by estimating S* using context trees

• Context trees arrange x in a tree

Each node corresponds to sequence of appended arc labels on path to root

Internal nodes correspond to repeating contexts in x

Leaves correspond to unique contexts Sentinel symbol x0=$ makes sure

symbols have different contexts

15

Context Tree Pruning(“To prune or not to prune…”)

• The MDL structure for state s yields the shortest description for symbols generated by s

• When processing state s:– Estimate MDL structures for states

0s and 1s– Decide whether to keep 0s and 1s or

prune them into state s– Base decision on coding lengths

0s 1s

00s 10s

additional structure

s

MDL structure for 1s

sorMDL structure for state s is

16

Phase I with Atomic Context Trees

• Atomic context tree:Arc labels are atomic (single symbol) Internal nodes are not necessarily branchingHas up to O(N2) nodes

• The coding length minimization of Phase I processes each node of the context tree [Nohre94]

• With atomic context trees, the worst-case complexity is at least O(N2) ☹

10

10

10

10

17

Compact Context Trees

• Compact context tree:Arc labels not necessarily atomicInternal node are branchingO(N) nodesCompact representation of the same tree

• Depth-first traversal of compact context tree provides O(N) complexity

• Theorem: Phase I of BWT-MDL requires O(N) operations performed with O(log(N)) bits of precision

1110

10

18

Phase II of BWT-MDL

• We determine the generator state using a novel algorithm that is based on properties of the Burrows Wheeler transform (BWT)

• Theorem: The BWT-MDL encoder requires O(N) operations performed with O(log(N)) bits of precision

• Theorem: [Willems et. al. 2000]: redundancy w.r.t. any tree source S is at most |S|[0.5 log(N)+O(1)] bits

ysDetermine s Assign p(xi|s)xip(xi|s) Arithmetic

Encoder

S*

19

Distributed/Parallel Compression of Bernoulli Sequences

• Splitter partitions x into B blocks x(1),…,x(B) • Encoder j{1,…,B} compresses x(j); it assigns probabilities p(xi(j)=1)= and p(xi(j)=0)=1- The total probability assigned to x is identical to that in a serial compression system

• This structure assumes that is known; our goal is to provide a universal parallel compression algorithm for Bernoulli sequences

y(1)

y(B)

Assign p(xi(1))p(xi(1))

Arithmetic Encoder 1xi(1)x(1)

Encoder 1

Assign p(xi(B))p(xi(B))

Arithmetic Encoder Bxi(B)x(B)

Encoder B

… ……

Splitterx

20

Two-Part Codes• Two-part codes use a semi-predictive approach to describe Bernoulli sequences:

– First part of code:• Determine the maximum likelihood (ML) parameter estimate ML(x)=n1/(n0+n1)• Quantize ML(x) to rk, one of K representation levels• Describe the bin index k with log(K) bits

– Second part of code encodes x using rk

• In distributed systems:– Sequential compressors require O(N) internal communications– Two-part codes need only communicate {n0(j),n1(j)}j{1,…,B}

Requires O(B log(K)) internal communications

k{1,…,K}

yrk

EncoderDetermine ML(x)x ML(x)

Quantizer

21

Jeffreys Two-Part Code• Quantize ML(x)

Bin edges bk=sin2(k/2K)

Representation levels rk=sin2((2k-1)/4K)

• Use K 1.772N0.5 bins• Source description:

– log(K) bits for describing the bin index k– Need –n1 log(ML(x))-n0log(1-ML(x)) for encoding x

b0 b1 bk-1 bk bK

r1 r2

b2

rKrk

22

Redundancy of Jeffreys Code for Bernoulli Sequences

• Redundancy: log(K) bits for

describing k N D(ML(x)||rk) bits for

encoding x using imprecise model

• D(a||b) is Kullback Leibler divergence

• In bin k, l(x)=-ML(x)log(rk )-[1-ML(x)] log(1-rk ) l( ML (x)) is poly-line• Redundancy = log(K)+ l(ML(x))– N H(ML(x)) log(K) + L

Use quantizers that have small L distance between the entropy function and the induced poly-line fit

23

Redundancy Properties• For x s.t. ML(x) is quantized to rk, the worst-case redundancy is

log(K)+N max{D(bk||rk),D(bk-1||rk)}

• D(bk||rk) and D(bk-1||rk) • Largest in initial or end bins• Similar in the middle bins• Difference reduced over wider range of k for larger N (larger K)

Can construct a near-optimal quantizer by modifying the initial and end bins of the Jeffreys quantizer

24

Redundancy Results

• Theorem: The worst-case redundancy of the Jeffreys code is 1.221+O(1/N) bits above Rissanen’s bound

• Theorem: The worst-case redundancy of the optimal two-part code is 1.047+O(1) bits above Rissanen’s bound

25

Parallel Universal Compression for Bernoulli Sequences

• Phase I:• Parallel units (PUs) compute symbol counts for the

B blocks• Coordinating unit (CU) computes and quantizes the

MDL parameter estimate ML(x) and describes k• Phase II: B PUs encode the B blocks based on rk

x(1)

x(B)

ML(x) [j n0(j)] / [j n0(j)+j n1(j)]

n0(B), n1(B)

n0(1), n1(1)

rk

Encoder 1

Encoder B y(B)

y(1)

k

…x(1)

x(B)Determine n0(B) and n1(B)

Determine n0(1) and n1(1)

…

Quantizer

ML(x)

26

Why do we need Parallel Semi-Predictive Coding?

• Naïve parallelization:• Partition x into B blocks• Compress blocks independently• The redundancy for a length-N/B block is O(log(N/B))Total redundancy is O(B log(N/B))

• Rissanen’s bound is O(log(N)) The redundancy with naïve parallelization is excessive!

x(1)

x(B)

y(1)

y(B)

Splitterx

…

Compressor 1

Compressor B

… …

27

Parallel Semi-Predictive (PSP) Concept

• Phase I:• B parallel units (PUs) accumulate statistics (symbol

counts) on the B blocks• Coordinating unit (CU) computes the MDL tree

source estimate S*

• Phase II:-- B PUs compress the B blocks based on S*

y(1)

y(B)

x(1)

x(B)Statistics Accumulator B

x(1)

x(B)

Statistics Accumulator 1

Coordinating Unit

symbol counts B

symbol counts 1

S*

Compressor 1

Compressor B

S*

Phase I Phase II

……

28

Source Description in PSP

• Phase I: the CU describes the structure of S* and the quantized ML parameter estimates {ks}sS*

• Phase II: each of B PUs compresses block x(b) just like Phase II of the (serial) semi-predictive approach

p(xi(b)|s)1{xi(b)=1}rks+1{xi(b)=0}(1-rks

)

sDetermine sxi(b)

Determine p(xi(b)|s)

p(xi(b)|s)y(b)

Arithmetic Encoder

Describe S*

structureDescribe bin indices {ks}sS*

S* {ks}sS*

Coordinating Unit

Parallel unit b

29

Complexity of Phase I• Phase I processes each node of the context tree [Nohre94]• The CU processes the states of a full atomic context tree

of depth-Dmax, where Dmax log(N/B) • Processing a node:

• Internal node: requires O(1) time

• Leaf: CU adds up block symbol counts to compute each symbol count, i.e., ns

=b ns(b), where {0,1}

The CU processes a leaf node in O(B) time With O(N/B) leaves, the aggregate complexity is O(N),

which is excessive

Dmax

2Dmax=O(N/B)

30

Phase I in O(N/B) Time

• We want to compute ns=b ns

(b) faster• An adder tree incurs O(log(B)) delay for adding up B

block symbol counts• Pipelining enables us to generate a result every O(1) time O(N/B) nodes, each requiring O(1) time

31

Phase II in O(N/B) Time

• The challenging part in Phase II is determining s:• Define the context index for a length-Dmax context s

preceding xi(b) as the binary number that represents s • The length-2Dmax generator table g satisfies gj=sS* if

s is a suffix of the context whose context index is j• We can construct g in O(N/B) time (far from trivial!)

• Compute context indices for all symbols of x(b) and determine the generating states via the generator table g

sDetermine sxi(b) Determine

p(xi(b)|s)p(xi(b)|s)

y(b)Arithmetic Encoder

S* {ks}sS*

Parallel unit b

32

Decoder

• An input bus is demultiplexed to multiple units• The MDL source and quantized ML parameters are

reconstructed• The B compressed blocks y(B) are decompressed on

B decoding units

y(B)

y(1)Decoding Unit 1 x(1)

x(B)

…

Decoding Unit B

S*,{ks}sS*Reconstruct S*,{ks}sS*D

EMUX

bus y

……

33

Theoretical Results

• Theorem: With computations performed with 2 log(N) bits of precision defined as O(1) time:– Phase I of PSP approximates the MDL coding length within

O(1) of the true optimum

– The PSP algorithm requires O(N/B) time

• Theorem: The PSP algorithm uses a total of O(N) words of memory = a total of O(N log(N)) bits

• Theorem: The pointwise redundancy of PSP w.r.t. S* is (x) < B[log(N/B)+O(1)]+|S|*[log(N)/2+O(1)]

parallelization overhead

34

Main Contributions• BWT-MDL (O(N) universal encoder):

An O(N) algorithm that achieves Rissanen’s redundancy bounds on best achievable compression

Combines efficient prefix tree construction with semi-predictive approach to universal coding

• Fast Suffix Sorting (not in this talk):– Core algorithm is very simple (can be implemented in VLSI)– Worst-case complexity O(N log0.5(N)) – Competitive with other suffix sorting methods in practice

• Two-Part Codes: Rigorous analysis of their compression quality Application to distributed/parallel compression– Optimal two-part codes

• Parallel Compression Algorithm (not in this talk): Work-efficient O(N/B) algorithm Compression loss is roughly B log(N/B) bits

35

More…

• Results have been extended to |X|-ary alphabet• Future research can concentrate on:

– Processing broader classes of tree sources– Problems in statistical inference

• Universal classification• Channel decoding• Prediction

– Characterize the design space for parallel compression algorithms

36

Generic Phase Iif (s is a leaf) {

Count symbol appearances ns0 and ns

1

MDLs length(ns0, ns

1)} else { /* s is an internal node */

Recursively compute MDL length and counts for 0s and 1s

ns0 n0s

0+n1s0, ns

1 n0s1+n1s

1

MDLs length(ns0, ns

1)

if (MDLs >MDL0s +MDL1s )Keep 0s and 1s

} else { Prune 0s and 1s, keep s

}}

Fast Parallel Algorithms for Universal Lossless Source Coding

Documents

Transcript of Fast Parallel Algorithms for Universal Lossless Source Coding