TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ......
Transcript of TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ......
![Page 1: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/1.jpg)
2.12.2008
1
Text Algorithms (4AP)Lecture: Compression
Jaak Vilo
2008 fall
1MTAT.03.190 Text Algorithms Jaak Vilo
Problem
• Compress
– Text
– Images, video, sound, …
• Reduce space, efficient communication, etc…
• Exact compression/decompression
• Lossy compression
![Page 2: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/2.jpg)
2.12.2008
2
• Managing Gigabytes: Compressing and IndexingCompressing and Indexing Documents and Images
• Ian H. Witten, Alistair Moffat, Timothy C. Bell
• Hardcover: 519 pages Publisher:Morgan Kaufmann; 2nd RevisedMorgan Kaufmann; 2nd Revised edition edition (11 May 1999) Language English ISBN‐10:1558605703
Links
• http://datacompression.info/
• http://en.wikipedia.org/wiki/Data_compression
• Data Compression Debra A. Lelewer and Daniel S. Hirschberg– http://www.ics.uci.edu/~dan/pubs/DataCompression.html
• Compression FAQ http://www.faqs.org/faqs/compression‐faq/
• Information Theory Primer With anInformation Theory Primer With an Appendix on Logarithms by Tom Schneider http://www.lecb.ncifcrf.gov/~toms/paper/primer/
• http://www.cbloom.com/algs/index.html
![Page 3: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/3.jpg)
2.12.2008
3
Problem
• Information transmission
• Information storage g
• The data sizes are huge and growing – fax ‐ 1.5 x 106 bit/page
– photo: 2M pixels x 24bit = 6MB
– X‐ray image: ~ 100 MB?
– Microarray scanned image: 30‐100 MB
– Tissue‐microarray ‐ hundreds of images, each tens of MB
– Large Hardon Collider (CERN) ‐ The device will produce few peta (1015) bytes of stored data in a yeardata in a year.
– TV (PAL) 2.7 ∙ 108 bit/s
– CD‐sound, super‐audio, DVD, ...
What it’s about?
• Elimination of redundancy
• Being able to predict• Being able to predict…
• Compression and decompression – Represent data in a more compact way
– Decompression ‐ restore original form
• Lossy and lossless compression – Lossless ‐ restore in exact copy
L l h i f i– Lossy ‐ restore almost the same information • Useful when no 100% accuracy needed
• voice, image, movies, ...
• Decompression is deterministic (lossy in compression phase)
• Can achieve much more effective results
![Page 4: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/4.jpg)
2.12.2008
4
Methods covered:
• Code words (Huffman coding)
R l th di• Run‐length encoding
• Arithmetic coding
• Lempel‐Ziv family (compress, gzip, zip, pkzip, ...)
• Burrows‐Wheeler family (bzip2)
• Other methods, including images , g g
• Kolmogorov complexity
• Search from compressed texts
Model
Model Model
E d D dData Data
Compresseddata
Encoder Decoder
![Page 5: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/5.jpg)
2.12.2008
5
• Let pS be a probability of message S
• The information content can be represented in terms of bits• The information content can be represented in terms of bits
• I(S) = ‐log( pS ) bits
• If the p=1 then the information content is 0 (no new information) – If Pr[s]=1 then I(s) = 0.
– In other words, I(death)=I(taxes)=0
• I( heads or tails ) = 1 ‐‐ if the coin is fair
• Entropy H(S) is the average information content of S
– H(S) = pS ∙ I(S) = ‐pS log( pS ) bits
![Page 6: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/6.jpg)
2.12.2008
6
http://en.wikipedia.org/wiki/Information_entropy
• Shannon's experiments with human predictors h i f ti t f b t 6 dshow an information rate of between .6 and 1.3 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression ratio of 1.5 bits per character.
![Page 7: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/7.jpg)
2.12.2008
7
• The data compression world is all abuzz about Marcus Hutter’s recently announced 50,000 euro prize for record‐breaking data compressors. Marcus, of the Swiss Dalle Molle Institute for Artificial Intelligence, apparently in cahoots with Florida compression maven Matt Mahoney, is offering cash prizes for what amounts to the most impressive ability to compress 100 MBytes of Wikipedia data. (Note that nobody is going to win exactly 50,000 euros ‐ the prize amount is prorated based on how well you beat the current record.)prize amount is prorated based on how well you beat the current record.)
• This prize differs considerably from my Million Digit Challenge, which is really nothing more than an attempt to silence people foolishly claiming to be able to compress random data. Marcus is instead looking for the most effective way to reproduce the Wiki data, and he’s putting up real money as an incentive. The benchmark that contestants need to beat is that set by Matt Mahoney’s paq8f , the current record holder at 18.3 MB. (Alexander Ratushnyak’s submission of a paq variant looks to clock in at a tidy 17.6 MB, and should soon be confirmed as the new standard.)
• So why is an AI guy inserting himself into the world of compression? Well, Marcus realizes that good data compression is all about modeling the data. The better you understand the
http://marknelson.us/2006/08/24/the‐hutter‐prize/#comment‐293
that good data compression is all about modeling the data. The better you understand the data stream, the better you can predict the incoming tokens in a stream. Claude Shannon empirically found that humans could model English text with an entropy of 1.1 to 1.6 0.6 to 1.3 bits per character, which at at best should mean that 100 MB of Wikipedia data could be reduced to 13.75 7.5 MB, with an upper bound of perhaps 20 16.25 MB. The theory is that reaching that 7.5 MB range is going to take such a good understanding of the data stream that it will amount to a demonstration of Artificial Intelligence.
• No compression can on average achieve better compression than the entropythan the entropy
• Entropy depends on the model (or choice of symbols)
• Let M={ m1, .. mn } be a set of symbols of the model A and let p(mi) be the probability of the symbol mi
• The entropy of the model A, H(M) is ‐∑i=1..n p(mi) ∙ log( p(mi) ) bits
• Let the message S = s1, .. sk, and every symbol si be in the model M. The information content of model A is ‐∑i=1..k log p(si)
• Every symbol has to have a probability, otherwise it cannot be coded if it is present in the data
![Page 8: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/8.jpg)
2.12.2008
8
Static or adaptive
• Static model does not change during the compressioncompression
• Adaptive model can be updated during the process
• Symbols not in message cannot have 0‐probability
• Semi‐adaptive model works in 2 stages, off‐line.
• First create the code table, then encode the message with the code table
How to compare compression techniques?
• Ratio (t/p) t: original message length
• p: compressed message length
• In texts ‐ bits per symbol
• The time and memory used for compression
• The time and memory used for decompression
• error tolerance (e.g. self‐correcting code)
![Page 9: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/9.jpg)
2.12.2008
9
• S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg'
• Alphabet of 8
• Length = 40 symbols
• Equal length codewords
• 3‐bit a 000 b 001 c 010 d 011 e 100 f 101 g 110 space 110
• S compressed ‐ 3*40 = 120 bits
Run‐length encoding
• http://michael.dipperstein.com/rle/index.html
• The string:
• "aaaabbcdeeeeefghhhij"
• may be replaced with
• "a4b2c1d1e5f1g1h3i1j1".
• This is not shorter because 1‐letter repeat takes more characters...
• "a3b1cde4fgh2ij"• a3b1cde4fgh2ij
• Now we need to know which characters are followed by run‐length.
• E.g. use escape symbols.
• Or, use the symbol itself ‐ if repeated, then must be followed by run‐length
• "aa2bb0cdee3fghh1ij"
![Page 10: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/10.jpg)
2.12.2008
10
Coding the alphabetically ordered word‐lists
resume retail retain
0resume 2tail 5n
retard retire
4rd 3ire
Coding techniques
• Coding refers to techniques used to encode t k b ltokens or symbols.
• Two of the best known coding algorithms are Huffman Coding and Arithmetic Coding.
• Coding algorithms are effective at compressing data when they use fewer bits forcompressing data when they use fewer bits for high probability symbols and more bits for low probability symbols.
![Page 11: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/11.jpg)
2.12.2008
11
Variable length encoders
• How to use codes of variable length?
• Decoder needs to know how long is the symbol
• Prefix‐free code: no code can be a prefix of another codeanother code
• Calculate the frequencies and probabilities of b lsymbols:
freq ratio p(s)a 2 2/40 0.05b 3 3/40 0.075c 4 4/40 0.1d 5 5/40 0 125d 5 5/40 0.125space 5 5/40 0.125e 6 6/40 0.15f 7 7/40 0.175g 8 8/40 0.2
![Page 12: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/12.jpg)
2.12.2008
12
Algoritm Shannon‐Fano
• Input: probabilities of symbols
• Output: Codewords in prefix free coding
1. Sort symbols by frequency
2. Divide to two almost probable groups
3. First group gets prefix 0, other 1
4. Repeat recursively in each group until 1 symbol remains
Example 1
a 1/2 0b 1/4 10c 1/8 110d 1/16 1110
1/32 11110e 1/32 11110f 1/32 11111
![Page 13: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/13.jpg)
2.12.2008
13
Example 1
Code:
a 1/2 0b 1/4 10c 1/8 110d 1/16 1110
1/32 11110e 1/32 11110f 1/32 11111
Shannon‐Fano
S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg'
p(s) codeg 0.2 00f 0.175 010e 0.15 011d 0.125 100space 0 125 101
1
0.5250.2
0.3250.15
0.175
space 0.125 101c 0.1 110b 0.075 1110a 0.05 1111
0.475
![Page 14: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/14.jpg)
2.12.2008
14
Shannon‐Fano
• S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg'
• S in compressed is 117 bits
• 2*4 + 3*4 + 4*3 + 5*3 + 5*3 + 6*3 + 7*3 + 8*2 = 117
• Shannon‐Fano not always optimal
• Sometimes 2 equal probable groups cannot be achieved
• Usually better than H+1 bits per symbol, when H is entropy.
Huffman code
• Works the opposite way.
• Start from least probable symbols and separate them with 0• Start from least probable symbols and separate them with 0 and 1 (sufix)
• Add probabilities to form a "new symbol" with the new probability
• Prepend new bits in front of old ones.
![Page 15: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/15.jpg)
2.12.2008
15
Huffman exampleChar Freq Code
space 7 111
a 4 010
e 4 000
f 3 1101
h 2 1010
i 2 1000
m 2 0111
n 2 0010
s 2 1011
t 2 0110
l 00l 1 11001
o 1 00110
p 1 10011
r 1 11000
u 1 00111
x 1 10010"this is an example of a huffman tree"
• Huffman coding is optimal when the f i f i t h t ffrequencies of input characters are powers of two. Arithmetic coding produces slight gains over Huffman coding, but in practice these gains have not been large enough to offset arithmetic coding's higher computational complexity and patent royalties
• (as of November 2001/Jul2006, IBM owns patents on the core concepts of arithmetic coding in several jurisdictions).http://en.wikipedia.org/wiki/Arithmetic_coding#US_patents_on_arithmetic_coding
![Page 16: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/16.jpg)
2.12.2008
16
Properties of Huffman coding
• Error tolerance quite good
• In case of the loss, adding or change of a single bit, the differences remain local to the place of the error
• Error usually remains quite local (proof?)
• Has been shown the code is optimal• Has been shown, the code is optimal
• Can be shown the average result is H+p+0.086, where H is the entropy and p is the probability of the most probable symbol
Move to Front
• Move to Front (MTF), Least recently used (LRU)(LRU)
• Keep a list of last k symbols of S
• Code
– use the code for symbol.
if in codebook move to front– if in codebook, move to front
– if not in codebook, move to first, remove the last
• c.f. the handling of memory paging
• Other heuristics ...
![Page 17: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/17.jpg)
2.12.2008
17
Arithmetic (en)coding
• Arithmetic coding is a method for lossless data compression.
• It is a form of entropy encoding, but where other entropy encoding py g py gtechniques separate the input message into its component symbols and replace each symbol with a code word, arithmetic coding encodes the entire message into a single number, a fraction n where (0.0 n < 1.0).
• Huffman coding is optimal for character encoding (one character‐one code word) and simple to program. Arithmetic coding is better still, since it can allocate fractional bits, but more complicated.
• Wikipedia http://en.wikipedia.org/wiki/Arithmetic_encoding
• Entire message is a single floating point number, a fraction n where (0.0 n < 1.0).
• Every symbol gets a probability based on the model
• Probabilities represent non‐intersecting intervals
• Every text is such an interval
Let P(A)=0.1, P(B)=0.4, P(C)=0.5
A [0 0 1)A [0,0.1)AA [0 , 0.01)AB [0.01 , 0.05)AC [0.05 , 0.1 )
B [0.1,0.5)BA [0.1 , 0.14)BB [0.14 , 0.3 )BC [0.3 , 0.5 )[ )
C [0.5,1)CA [0.5 , 0.55 )CB [0.55 , 0.75 )CC [0.75 , 1 )
![Page 18: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/18.jpg)
2.12.2008
18
• Add a EOF symbol.
• Problem with infinite precision arithmetics
• Alternative ‐ blockwise, use integer‐arithmetics
• Works, if smallest pi not too small
• Best ratio
• Problem ‐ the speed, and error tolerance, small change has catastrophic effect
![Page 19: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/19.jpg)
2.12.2008
19
• Arithmetic coding revisited by Alistair Moffat, Radford M. Neal, Ian H. Witten ‐ http://portal.acm.org/citation.cfm?id=874788Neal, Ian H. Witten http://portal.acm.org/citation.cfm?id 874788
• Models for arithmetic coding ‐
• HMM Hidden Markov Models
• ...
• Context methods: Abrahamson dependency model
• Use the context to maximum, to predict the next symbol
• PPM ‐ Prediction by Partial Matching
• Several contexts, choose best
• Variations
Dictionary based
• Dictionary (symbol table) , list codes
• If not in disctionary, use escape
• Usual heuristics searches for longest repeat
• With fixed table one can search for optimal code
• With adaptive dictionary the optimal coding is NP‐complete
• Quite good for English language texts, for example
![Page 20: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/20.jpg)
2.12.2008
20
Lempel‐Ziv, LZ, LZ‐77
• Use the dictionary to memorise the previously compressed parts
• LZ‐77 perep
• Sliding window of previous text; and text to be compressed
/bbaaabbbaaffacbbaa…./...abbbaabab...
• Lookahead ‐ longest prefix that begins within the moving window, is encoded with [position,length]
• In example, [5,6]
• Fast! (Commercial software, e.g. PKZIP, Stacker, DoubleSpace, )
• Several alternative codes for same string (alternative substrings will match)
• Lempel‐Ziv compression from McGill Univ. http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/
Original LZ77
• Triples [position,length,next char]
• If output [a,b,c], advance by b+1 positions
• For each part of the triple the nr of bits is reserved depending on window length
log(n‐f) + log(f) + 8where n is window size and f is lookahead sizewhere n is window size, and f is lookahead size
• Example: abbbbabbc [0,0,a] [0,0,b] [1,3,a] [3,2,c]
• In example the match actually overlaps with lookahead window
![Page 21: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/21.jpg)
2.12.2008
21
LZ‐78
• Dictionary
• Store strings from processed part of the message• Store strings from processed part of the message
• Next symbol is the longest match from dictionary, that matches the text to be processed
• LZ78 (Ziv and Lempel 1978)
• First, dictionary is empty, with index 0
• Code [i,c] ‐ refers to dictionary (word u at pos. i) and c is the next symbol
• Add uc to dictionary
• Example: ababcabc → [0,a][0,b][1,b][0,c][3,c]
LZW
• Code consists of indices only! • First dictionary has every symbol /alphabet/• First, dictionary has every symbol /alphabet/
• Update dictionary like LZ78
• In decoding there is a danger: See abcabca – If abc is in dictionary
– add abca to dictionary
– next is abca, output that code
– But when decoding, after abc it is not known that abca is in the dictionarydictionary
• Solution ‐ if the dictionary entry is used immediately after its creation, the 1st and last characters have to match
• Many representations for the dictionary.
• List, hash, sorted list, combination, binary tree, trie, suffix tree, ...
![Page 22: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/22.jpg)
2.12.2008
22
LZJ
• Coding ‐ search for longest prefix.
• Code ‐ address of the trie node
• From the root of the trie, there is a transition on every symbol (like in LZW).
• If out of memory, remove these nodes/branches that have been used onlynodes/branches that have been used only once
• In practice, h=6, dictionary has 8192 nodes
LZFG
• Effective LZ method
• From LZJ
• Create a suffix tree for the window
• Code ‐ node address plus nr opf characters from teh edge.
• The internal and leaf nodes with different codes
• small match directly... (?)
![Page 23: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/23.jpg)
2.12.2008
23
Burrows‐Wheeler
• See FAQ http://www.faqs.org/faqs/compression‐faq/part2/section‐9.html
• The method described in the original paper is really a composite of three different algorithms:algorithms:
– the block sorting main engine (a lossless, very slightly expansive preprocessor),
– the move‐to‐front coder (a byte‐for‐byte simple, fast, locally adaptive noncompressive coder) and
– a simple statistical compressor (first order Huffman is mentioned as a candidate) eventually doing the compression.
• Of these three methods only the first two are discussed here as they are what constitutes the heart of the algorithm. These two algorithms combined form a completely reversible (lossless) transformation that ‐ with typical input ‐ skews the first order symbol distributions to make the data more compressible with simple methods. Intuitively speaking, the method transforms slack in the higher order probabilities of the input block (thus making them moretransforms slack in the higher order probabilities of the input block (thus making them more even, whitening them) to slack in the lower order statistics. This effect is what is seen in the histogram of the resulting symbol data.
• Please, read the article by Mark Nelson:
• Data Compression with the Burrows‐Wheeler Transform Mark Nelson, Dr. Dobb's Journal September, 1996. http://marknelson.us/1996/09/01/bwt/
![Page 24: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/24.jpg)
2.12.2008
24
![Page 25: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/25.jpg)
2.12.2008
25
![Page 26: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/26.jpg)
2.12.2008
26
CODE: t: hat acts like this:<13><10><1t: hat acts like this:<13><10><1
t: hat buffer to the constructor
t: hat corrupted the heap, or wo
W: hat goes up must come down<13
t: hat happens, it isn't likely
w: hat if you want to dynamicall
t: hat indicates an error.<13><1
t: hat it removes arguments from
t: hat looks like this:<13><10><
t: hat looks something like this
t: hat looks something like this
t: hat once I detect the mangled
Syntactic compression
• Context Free Grammar for presenting the t tsyntax tree
• Usually for source code
• Assumption ‐ program is syntactically correct
• Comments
• Features, constants ‐ group by group
![Page 27: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/27.jpg)
2.12.2008
27
Image compression
• Many images, photos, sound, video, ...
Fax group 3
• Fax/group 3
• Black/white, 0/1 code
• Run‐length: 000111001000 → 3,3,2,1,3
• Variable‐length codes for representing run‐lengths.
![Page 28: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/28.jpg)
2.12.2008
28
• Joint Photographic Experts Group JPEG 2000http://www jpeg org/jpeg2000/http://www.jpeg.org/jpeg2000/
• Image Compression ‐‐ JPEG from W.B. Pennebaker, J.L. Mitchell, "The JPEG Still Image Data Compression Standard", Van Nostrand Reinhold, 1993.
• Color image, 8 or 12 bits per pixel per color.
F d S i l M d• Four modes Sequential Mode
• Lossless Mode
• Progressive Mode
• Hierarchical Mode
• DCT (Discrete Cosine Transform)
• from http://www.utdallas.edu/~aria/mcl/post/
• Lossy signal compression works on the basis of transmitting the "important" signal content, while omitting other parts (Quantization). To perform this quantization effectively, a linear de‐correlating transform is applied to the signal prior to quantization. All existing image and video coding standards use this approach. The most commonly used transform is the Discrete Cosine Transform (DCT) used in JPEG, MPEG‐1, MPEG‐2, H.261 and H.263 and its descendants. For a detailed discussion of the theory behind quantization and justification of the usage of linear transforms, see reference [1] below.
• A brief overview of JPEG compression is as follows. The JPEG encoder partitions h i i 8 8 bl k f i l T h f h bl k i li 2the image into 8x8 blocks of pixels. To each of these blocks it applies a 2‐dimensional DCT. The transform matrix is normalized (element‐wise) by a 8x8 quantization matrix and then rounded to the nearest integer. This operation is equivalent to applying different uniform quantizers to different frequency bands of the image. The high‐frequency image content can be quantized more coarsely than the low‐frequency content, due to two factors.
• L9_Compression/lena/
![Page 29: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/29.jpg)
2.12.2008
29
Vector quantization
• Vector quantization
• Dictionary‐meetod
• 2‐dimensional blocks
![Page 30: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/30.jpg)
2.12.2008
30
Discrete cosine transform
• A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functions oscillating at different frequencies. DCTs are important to numerous applications in science and engineering, from lossy compression of audio and images (where small high‐frequency components can be discarded), to spectral methods for the numerical solution of partial differential equations. The use of cosine rather than sine functions is critical in these applications: for compression, it turns out that cosine functions are much more efficient (as explained below, fewer are needed to approximate a typical signal), whereas for differential equations the cosines express a particular choice of boundarydifferential equations the cosines express a particular choice of boundary conditions.
• http://en.wikipedia.org/wiki/Discrete_cosine_transform
2d DCT (type II) ( yp )compared to the DFT. For both transforms, there is the magnitude of the spectrum on left and the histogram on right; both spectra are cropped to 1/4, to zoom the behaviour in th l f ithe lower frequencies. The DCT concentrates most of the power on the lower frequencies.
![Page 31: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/31.jpg)
2.12.2008
31
• In particular, a DCT is a Fourier‐related transformsimilar to the discrete Fourier transform (DFT) butsimilar to the discrete Fourier transform (DFT), but using only real numbers. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry (since the Fourier transform of a real and even function is real and even), where in some variants the input and/or output data are p / pshifted by half a sample. There are eight standard DCT variants, of which four are common.
• Digital Image Processing: http://www‐ee.uta.edu/dip/
Block Diagram of JPEG Baseline
ace’
s JP
EG
tuto
rial
(199
3)
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [62]
Fro
m W
alla
![Page 32: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/32.jpg)
2.12.2008
32
(Pri
ncet
on)
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [63]
475 x 330 x 3 = 157 KB luminance
Fro
m L
iu’s
EE
330
(
RGB Components
(Pri
ncet
on)
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [64]
Fro
m L
iu’s
EE
330
(
![Page 33: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/33.jpg)
2.12.2008
33
Y U V (Y Cb Cr) Components(P
rinc
eton
)
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [65]
Assign more bits to Y, less bits to Cb and Cr
Fro
m L
iu’s
EE
330
(
JPEG Compression (Q=75%)
(Pri
ncet
on)
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [66]
45 KB, compression ration ~ 4:1Fro
m L
iu’s
EE
330
(
![Page 34: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/34.jpg)
2.12.2008
34
JPEG Compression (Q=75%)(P
rinc
eton
)
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [67]
45 KB, compression ration ~ 4:1Fro
m L
iu’s
EE
330
(
JPEG Compression (Q=75% & 30%)
(Pri
ncet
on)
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [68]
45 KB 22 KBFro
m L
iu’s
EE
330
(
![Page 35: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/35.jpg)
2.12.2008
35
& R
.Liu
© 2
00
2)
Uncompressed (100KB)
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
&
ENEE631 Digital Image Processing (Spring'04)
Lec13 – Transf. Coding & JPEG [69]
JPEG 75% (18KB)
JPEG 50% (12KB)
JPEG 30% (9KB)
JPEG 10% (5KB)
UM
CP
EN
E
1.4‐billion‐pixel digital camera
• Monday, November 24, 2008http://www technologyreview com/computing/21705/page1/http://www.technologyreview.com/computing/21705/page1/
• Giant Camera Tracks Asteroids
• The camera will offer sharper, broader views of the sky.
• The focal plane of each camera contains an almost complete 64 x 64 array of CCD devices, each containing approximately 600 x 600 pixels, for a total of about 1.4 gigapixels. The CCDseach containing approximately 600 x 600 pixels, for a total of about 1.4 gigapixels. The CCDs themselves employ the innovative technology called "orthogonal transfer", which is described below.
Diagram showing how an OTA chip is made up of 64 OTCCD cells, each of which has 600 x 600 pixels
![Page 36: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/36.jpg)
2.12.2008
36
Fractal compression
• Fractal Compression group at Waterloohtt //li k t l /f t l h ht lhttp://links.uwaterloo.ca/fractals.home.html
• A "Hitchhiker's Guide to Fractal Compression" For Beginnersftp://links.uwaterloo.ca/pub/Fractals/Papers/Waterloo/vr95.pdf/ p
• Encode using fractals.
• Search for regions that with a simple transformation can be similar to each other.
• Compressin ratio 20‐80
• Moving Pictures Experts Group ( htt // hi i li / / )http://www.chiariglione.org/mpeg/ )
• MPEG Compression : http://www.cs.cf.ac.uk/Dave/Multimedia/node255.html
• Screen divided into 256 blocks, where the changes and movements are tracked
• Only differences from previous frame are• Only differences from previous frame are shown
• Compression ratio 50‐100
![Page 37: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/37.jpg)
2.12.2008
37
• When one compresses files, can we still use fast search techniques without decompressing frst?
• Sometimes yes• Sometimes, yes
• e.g. Udi Manber has developed a method
• Approximate Matching of Run‐Length Compressed StringsVeli Mäkinen, Gonzalo Navarro, Esko Ukkonen
• We focus on the problem of approximate matching of strings that have been compressed using run‐length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a pattern of m letters (m' runs) in a text of n letters (n' runs) in O(mm'n') time. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.
Kolmogorov (or Algorithmic) complexity
• Kolmogorov, Chaitin, ...
• What is the compressed version of sequence• What is the compressed version of sequence '1234567891011121314151617181920212223242526...' ?
• Every symbol appears almost equally frequently, almost "random" by entropy
• for i=1 to n do print i ;
• Algorithmic complexity (or Kolmogorov complexity) for string S is the length of the shortest program that reproduces S, often noted K(S)
• Conditional complexity : K(S|T). Reproduce S given T.
• http://en.wikipedia.org/wiki/Algorithmic_information_theory
![Page 38: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/38.jpg)
2.12.2008
38
• Algorithmic information theory is a field of study which attempts to capture the concept of complexity using toolsattempts to capture the concept of complexity using tools from theoretical computer science. The chief idea is to define the complexity (or Kolmogorov complexity) of a string as the length of the shortest program which, when run without any input, outputs that string. Strings that can be produced by short programs are considered to be not very complex. This notion is surprisingly deep and can be used to state and prove impossibility results akin to Gödel's incompleteness theoremand Turing's halting problem.
• The field was developed by Andrey Kolmogorov and Gregory Chaitin starting in the late 1960s.
G J Chaitin
• http://www.cs.umaine.edu/~chaitin/
![Page 39: TA Lecture 10 Compression.ppt - ut Lecture Compression 2up.pdf– TV (PAL) 2.7 ∙ 108 bit/s ... Encoder Decoder. 2.12.2008 5 • Let pS be a probability of message S ... • The data](https://reader030.fdocuments.in/reader030/viewer/2022040906/5e7b0e05ced04d37fe3dc8cf/html5/thumbnails/39.jpg)
2.12.2008
39
• Distance using K.
• d(S,T) = ( K(S|T) + K(T|S) ) / ( K(S) + K(T) )
• We cannot calculate K, but we can approximate it
• E.g. by compression LZ, BWT, etc
/• d(S,T) = ( C(ST) + C(TS) ) / ( C(S) + C(T) )
• Use of Kolmogorov Distance Identification of W b P A th hi T i d D iWeb Page Authorship, Topic and DomainDavid Parry (PPT) in OSWIR 2005, 2005 workshop on Open Source Web Information Retrieval
• Informatsioonikaugus by Mart Sõmermaa, Fall g y ,2003 (in Data Mining Research seminar)http://www.egeen.ee/u/vilo/edu/2003‐04/DM_seminar_2003_II/Raport/P06/main.pdf