Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

18
Parallel Data Parallel Data Compression Utility Compression Utility Jeff Gilchrist Jeff Gilchrist November 18, 2003 November 18, 2003 COMP 5704 COMP 5704 Carleton University Carleton University

Transcript of Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

  • Parallel Data Compression UtilityJeff GilchristNovember 18, 2003COMP 5704Carleton University

  • AgendaIntroductionWhat is data compression?Compression AlgorithmsBurrows-Wheeler Transform (BWT)Parallelizing BWTParallel BZIP2ConclusionQuestions

  • IntroductionA general purpose compressor that can take advantage of parallel computing should greatly reduce the amount of time it requires to compress and uncompress files.Goal is to modify the popular BZIP2 compression utility to support parallel processing in hopes that it will increase data compression performance.

  • What is data compression?Compression is used to compact files or data into a smaller form.

    Lossless data compression requires that when the data is uncompressed again, it must be identical to the original.Original FileCompressed FileUncompressed/Original File

  • Common compression programsSome common compression programs used in Unix and Windows are: ACE, BZIP2, GZIP, RAR, and ZIP

  • Compression AlgorithmsThe two general types of compression algorithms are: dictionary based and statistical.Dictionary algorithms (such as Lempel-Ziv) build dictionaries of strings and replace entire groups of symbols.The statistical algorithms develop models on the statistics of the input data and use those models to control the final output.

  • Dictionary AlgorithmsStrings of characters are replaced by tokens to reduce the size of the data.The dictionary contains the strings that the tokens represent.the frog jumped on the log.# fr@ jumped on # [email protected]# = the@ = og

  • Burrows-Wheeler TransformThe BWT is a block-sorting statistical compression algorithm.BWT achieves speeds similar to dictionary based algorithms.BWT achieves compression performance within a few percent of the best statistical compressors (PPM).

  • BWT AlgorithmBWT works in three stages: sorting, move-to-front, and final compressionInitial sorting stage permutes the input text so similar contexts are grouped togetherMove-To-Front stage converts local symbol groups into a single global structureFinal compression stage takes advantage of transformed data to produce efficient compressed output

  • BWT (Sort)String S (N characters long) form a matrix with N cyclic shifts of SMatrix is sorted lexicographicallyNew string L is formed from last column of matrix and I is index of the row in the matrix with the match.S = abracaL = caraabI = 1

    0aabrac1abraca2acaabr3bracaa4caabra5racaab

  • BWT (Move-To-Front)The Move-To-Front step defines a vector of integers R which represent codes for the string L.A list Y is then created containing the alphabet of L.R is created by setting R[i] to be # of characters preceding L[i] in Y. L[i] is then moved to the front of Y.L = caraab I = 1Y = a, b, c, rY = c, a, b, rY = a, c, b, rL[0] = cY = a, b, c, rR[0] = 2L[1] = aY = c, a, b, rR[1] = 1

  • BWT (Final Compression)The final R vector along with I is then compressed using Huffman or some other coding technique.Each element in R is treated as a separate token to be coded.Huffman EncodedR = (2 1 3 1 0 3) I = 1

  • Parallelizing BWTCouple of options for parallelizing BWT.Data to be compressed is broken into blocks before BWT is run. Each block can be independently processed and therefore run in parallel.Blocks are stitched back together at the end.Data To CompressData 1Data 2Data 3R, IR, IR, ICompressedHuffHuffHuff

  • Parallelizing BWTThe matrix in Sort stage needs to be sorted lexicographically

    Could use use a parallel sort algorithm to achieve speedup

    S = abracaCPU 1CPU 2CPU n

    0aabrac1abraca2acaabr3bracaa4caabra5racaab

  • BZIP2BZIP2 is popular compression utility in Unix used as replacement for GZIPBZIP2 uses BWT algorithm and is available free with source codeBZIP2 compresses single files so often used with TAR (ie: kernel-2.4.21.tar.bz2)BZIP2 works in a sequential manner

  • Parallel BZIP2Modify BZIP2 to process BWT blocks in parallelUse pthread model for SMP parallel computingLots of 2 & 4 CPU, and P4 Hyperthreaded machines available.Data To CompressCPU 1CPU 2Data 1R, IHuffData 2R, IHuffData 3R, IHuffData 4R, IHuff

  • ConclusionParallelizing BZIP2 should provide speedup and increase its performance.Once code is complete, testing will be performed to see if this is true and by how much.Compressing/uncompressing large amounts of data (ie: Linux kernel source) takes a lot of time and speeding up the process for people who have SMP machines should be useful.

  • QuestionsWhat data compression algorithm does BZIP2 use?How does the algorithms speed & compression compare to dictionary and other statistical algorithms?What parallel computing model is being used in the modified BZIP2?