Prof. Ja-Ling Wu Department of Computer Science and...
Transcript of Prof. Ja-Ling Wu Department of Computer Science and...
![Page 1: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/1.jpg)
Lempel-Ziv Coding
Prof. Ja-Ling Wu
Department of Computer Science and Information EngineeringNational Taiwan University
![Page 2: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/2.jpg)
2
J.Ziv and A.Lempel, “Compression of individual sequences by variable rate coding, “IEEE Trans. Information Theory, 1978.
Algorithm :The source sequence is sequentially parsed into strings that have not appeared so far.For example, 1011010100010…, will be parsed as
1,0,11,0 1,010,00,10,…After every comma, we look along the input sequence until we come to the shortest string that has not been marked off before. Since this is the shortest such string, all its prefixes must have occurred easier. In particular, the string consisting of all but the last bit of this string must have occurred earlier.
![Page 3: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/3.jpg)
3
We code this phrase by giving the location of the prefix and the value of the last bit.Let C(n) be the number of phrases in the parsing of the input n-sequence. We need log C(n) bits to describe the location of the prefix to the phrase and 1 bit to describe the last bit.In the above example, the code sequence is :(000,1), (000,0), (001,1), (010,1), (100,0), (010,0), (001,0)where the first number of each pair gives the index of the prefix and the 2nd number gives the last bit of the phrase.
![Page 4: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/4.jpg)
4
Decoding the coded sequence is straight forward and we can recover the source sequence without error!!Two passes-algorithm :pass-1 : parse the string and calculate C(n)pass-2 : calculate the points and produce the code string
The two-passes algorithm allots an equal number of bits to all the pointers. This is not necessary, since the range of the pointers is smaller at the initial portion of the string.
• T-A. Welch, A technique for high-performance data Compression, Computer, 1984.
• Bell, Cleary, Witlen, Text Compression, Prentice-Hall, 1990.
Dictionary construction
Pattern Matching
![Page 5: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/5.jpg)
5
Definition :A parsing S of a binary string x1, x2,…,xn is a division of the string into phrases, separated by commas. A distinct parsing is a parsing such that no two phrases are identical.The L-Z algorithm described above gives a distinct parsing of the source sequence. Let C(n) denote the number of phrases in the L-Z parsing of a sequence of length n.The compressed sequence (after applying the L-Z algorithm) consists of a list of c(n) pairs of numbers, each pair consisting of a pointer to the previous
occurrence of the prefix of the phrase and the last bitof the phrase. 1
log c(n)
![Page 6: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/6.jpg)
6
the total length of the compressed sequence is : C(n)(log c(n)+1)
we will show that
for a stationary ergodic sequence X1,X2,…,Xn.
( )( ) ( )χHn
nCnC→
+1log)(
![Page 7: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/7.jpg)
7
Lemma 1 (Lempel and Ziv) :The number of phrase c(n) in a distinct parsing of a binary sequence X1,X2,…,Xn satisfies
where εn→ 0 as n → ∞.
( ) ( ) nnnCn log1 ε−
≤
![Page 8: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/8.jpg)
8
Let
: the sum of the lengths of all distinct strings of length less than or equal to k.
C(n) : the no. of phrases in a distinct parsing of a sequence of length nthis number is maximized when all the phrases are as short as possible.
n=nk → all the phrases are of length ≦ k, thus
If nk≦n<nk+1, we write n=nk+Δ, then Δ=n-nk < nk+1-nk = (k+1)2k+1
( ) 2212 1
1+−=⋅= +
=∑ k
k
j
jk kjn
( )11
22222 11
1 −≤
−−
=≤−=≤ ++
=∑ k
nk
nnC kkkKk
j
j
![Page 9: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/9.jpg)
9
Then the parsing into shortest phrases has each of the phases of length≦ k and phrases of the length k+1.
We now bound the size of k for a given n.Let nk ≦ n < nk+1. Then n ≧ nk = (k-1)2k+1+2 ≧ 2k
→ k ≦ log nn ≦ nk+1 = k.2k+2+2 ≦ (k+2)2k+2
≦ [ (log n+2) .2k+2 ]
1+Δ
k( ) ( )
( )1
11111
−≤⇒
−=
−Δ+
≤+Δ
+−
≤+Δ
+=
KnnC
Kn
kn
kKn
knCnC kk
k
2loglog2
+≥+⇒
nnk
![Page 10: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/10.jpg)
10
for all n ≧ 4( ) ( )
( )
( )
( )
( )
( ) ( )( )
∞→→+
=
−=
−≤⇒
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛ +−=
⎟⎟⎠
⎞⎜⎜⎝
⎛ +−≥
⎟⎟⎠
⎞⎜⎜⎝
⎛ ++−=
−+−≥−+=−
nnn
nn
KnnC
n
nnn
nnn
nn
n
nnkk
n
n
n
as 0log
4loglogwhere
log11
log1
loglog
4loglog1
loglog
3log2log1
loglog
32loglog1
32logloglog321
ε
ε
ε
![Page 11: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/11.jpg)
11
Lemma 2 :Let Z be a positive integer valued r.v. with mean μ.Then the entropy H(z) is bound byH(z) ≦ (μ+1) log(μ+1) –μlogμ
pf : H W.
![Page 12: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/12.jpg)
12
Let be a stationary ergodic process with pmfp(x1,x2,…xn). For fixed integer k, defined the kth order Markov approximation to P as
, and the initial state will be part of the specification of Qk.
Since is itself an ergodic process, we have
{ }∞ −∞=iiX
( )( ) ( )( ) ( )( ) jixxxx
xxPxPxxxxQ
jiij
i
n
j
jkjjknKk
≤Δ
Δ
+
=
−−−−−− ∏
,,...,, where
,...,,,...
1
1
101101
( )0
1−− kx( )1−
−n
knn XXP
( )( ) ( )( ) ( )11
1
10121
log
log1,...,,log1
−−
−−
=
−−−−
Δ−→
−=− ∑j
kjjj
kjj
n
j
jkjjknK
XXHXXPE
XXPn
XXXXQn
![Page 13: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/13.jpg)
13
We will bound the rate of the L-Z code by the entropy rate of the k-th order Markov approximation for all k. The entropy rate of the Markov approximationconverges to the entropy rate of the process as k → ∞and this will prove the result.
( )1−−j
kjj XXH
![Page 14: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/14.jpg)
14
S’pose , and s’pose that is parsed into C distinct phrases, y1,y2,…,yc. Let νi be the index of the start of the i-th phrase, i.e., . For each i=1,2,…,c, defines . Thus, Si is the k bits of x preceding yi, of course,
Let Cls be the number of phrases yi with length l and preceding state Si=S for l=1,2,… and , we thenhave
and
( ) ( )n
kn
k xX 11 −−−− = nx1
11 −+= i
ixy iνν1−
−= i
i ki xS νν
( )0
11 −−= kxS
kXs∈
nlC
CC
slls
slls
=
=
∑
∑
,
,
![Page 15: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/15.jpg)
15
Lemma 3 : ( Ziv’s inequality )For any distinct parsing (in particular, the L-Z parsing) of the string x1,x2,…,xn, we have
Note that the right hand side does not depend on Qk.
proof : we write
( ) ∑−≤sl
lslsnk CCsxxxQ,
121 log,...,,log
( ) ( )
( )∏=
=
=C
iii
cnk
syP
syyyQsxxxQ
1
121121 ,...,,,...,,
![Page 16: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/16.jpg)
16
( ) ( )
( )
( )
( )∑ ∑
∑∑
∑ ∑
∑
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛≤
=
=
=
==
==
==
=
slSs
lyiii
lsls
Sslyi
iilssl
ls
sl sslyiii
C
iiink
i
i
ii
ii
syPC
C
syPC
C
syP
syPsxxxQ
, :
:,
, ,:
1121
1log
log1
log
log,...,,logor
Now since the yi are distinct, we have ( )∑==
≤
Sslyi
ii
i
i
syP
:
1
( ) ∑≤⇒sl ls
lsnK CCsxxxQ
,121
1log,...,,log
![Page 17: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/17.jpg)
17
Theorem :Let {Xn} be a stationary ergodic process with entropy rate H(XX), and let C(n) be the number of phrases in a distinct parsing of a sample of length n from this process. Then
with probability 1.
( ) ( ) ( )XHn
nCnCn
≤∞→
logsuplim
![Page 18: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/18.jpg)
18
( )
∑∑
∑
∑
=Π=Π
=Π
−−=
⋅−≤
slsl
slsl
lsls
sl
lsls
ls
lslsnk
Cnl
CC
CC
CCCCC
CCCCsxxxQ
,,
,,
,
121
,1
have we, writting
loglog
log,...,,log:pf
![Page 19: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/19.jpg)
19
We now define r.v.’s U,V, such that Pr(U=l,V=s) = Πl,s
Thus E(U)=n/c, and
Since H(U,V) ≦ H(U) + H(V)and H(V) ≦ log |XX|k = k
From Lemma 2, we have
( ) ( )VUHnCC
nCsxxxQ
n
ccVUCHsxxxQ
nK
nk
,log,...,,log1log),()|,...,,(log
121
121
−≥−⇒
−≤
( ) ( ) ( )
( ) ( ).1log, , Thus
1log1log
log1log1
log1log1
OCn
nCk
nCVUH
nC
nC
Cn
Cn
Cn
Cn
Cn
Cn
EUEUEUEUUH
++≤
⎟⎠⎞
⎜⎝⎛ +⎟
⎠⎞
⎜⎝⎛ ++=
−⎟⎠⎞
⎜⎝⎛ +⎟
⎠⎞
⎜⎝⎛ +=
−++≤
![Page 20: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/20.jpg)
20
For a given n, the maximum of is attained for
the maximum value of C . But from Lemma 1,
Cn
nC log
⎟⎠⎞
⎜⎝⎛ ≤
enC 1for
( )( )
( ) ∞→→
⎟⎟⎠
⎞⎜⎜⎝
⎛≤
+≤
n as 0, thereforeand
loglogloglog
Thus . 11log
VUHnC
nnO
Cn
nC
On
nC
![Page 21: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/21.jpg)
21
Therefore,
where εk(n) →0 as n→∞. Hence, with probability 1,
( ) ( ) ( ) ( )nsxxxQnn
nCnCknk ε+−
≤ 121 ,...,,log1log
( ) ( )( )( )
( ) ( ) .,...,
,...,,log1limlogsuplim
10
0121
∞→→=
−≤
−−
−−∞→∞→
kasXHXXXH
XXXXQnn
nCnC
k
knknn
![Page 22: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/22.jpg)
22
Theorem : Let be a stationary ergodic stochastic process. Let l(X1,X2,…Xn) be the L-Z codeword length associated with X1,X2,…,Xn. Then
pf :
The L-Z code is a universal code, in which, the code does not depend on the distribution of the source!!
( ) ( ) ( )( )( )
( ) ( ) ( ) ( )
( ) 1.y probabilit with ,
logsuplim,...,,suplim
0suplim 1, Lemmaby
1log,...,,
21
21
XHnnC
nnCnC
nXXXl
nnC
nCnCXXXl
n
n
n
n
≤
⎟⎠⎞
⎜⎝⎛ +=⇒
=
+=
∞→∞→
( ) ( ) 1.y probabilitwith ,...,,1lim 21 XHXXXln nn
≤∞→
{ }∞ −∞=iiX
![Page 23: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/23.jpg)
23
Optimal Variable Length-to-Fixed Length CodeAlgorithm :
step 1. 永遠分最大的 nodestep 2. 分到有 2l leaf nodes 則停
Example : Source data input : A B C C B A A A CPA=0.7PB=0.2 code-tree AAAAPC=0.1 AAA 0.343 AAAB
AA 0.49 AAB 0.098 AAACA 0.7 AB 0.14 AAC 0.049
root B 0.2 AC 0.07 C 0.1
![Page 24: Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual](https://reader030.fdocuments.in/reader030/viewer/2022040115/5e8b90aa3b12b70131307d5d/html5/thumbnails/24.jpg)
24
A 1011B 1010C 1001AA 1000AB 0111AC 0110AAA 0101AAB 0100AAC 0011 AAAA 0010AAAB 0001AAAC 0000
A B, C, C, B, A A A C
0111 1001 1001 1010 0000
Tunstall ’77
GIT ph.D. Thesis