Scientific data compression through wavelet transformation chris fleizach cse262.

scientific data compression through

wavelet transformation

chris fleizach

cse262

Problem Statement

Scientific data is hard to compress with traditional “run-length” encoders, like gzip or bzip because common patterns do not exist

If the data can be transformed into a form that retains the information but can be thresholded, it can be compressed

Thresholding removes excessively small features and replaces with a zero (much easier to compress than a 64b double)

Wavelet transforms are well suited for this purpose and have been used in image compression (jpeg 2000)

Why Wavelets?

Wavelet transforms encode more information than other techniques like Fourier transforms.

Time and frequency information is saved In practical terms, the transformation is applied to

many scales and sizes within the signal This results in vectors that encode approximation

and detail information. By separating the signals, it is easier to threshold

and remove information. Thus the data can be compressed

Wavelets in Compression

The jpeg2000 standard gave up the discrete cosine transform in favor of a wavelet transform

The FBI uses wavelets to compress fingerprint scans by 15 – 20 times the original size

Choosing the Right WaveletThe Transform

Continuous wavelet transform – the sum over the time of the signal convolved by the scaled and shifted versions of the wavelet

Unfortunately, it’s slow and generates way too much data. It’s also hard to implement

f t s , t dt

From mathworks.com

Choosing the Right WaveletThe Transform

The discrete transform - if the scales and positions are chosen based on powers of two, then the transform will be much more efficient and just as accurate.

Then the signal is sent through only two “subband” coders (which get the approximation and the detail data from the signal).

Signal decomposed bylow pass and high passfilters to get approx anddetail info.

From mathworks.com

Choosing the Right WaveletThe Decomposition

The signal can be recursively decomposed to get finer detail and more general approximation.

This is called multi-level decomposition.

A signal can be decomposed as many times as it can be divided in half.

Thus, we only have one approximation signal at the end of the process

From mathworks.com

Choosing the Right WaveletThe Wavelet

The low and high pass filters (subband coders) are in reality the wavelet that is used. There have been a wide variety of wavelets created over time

The low pass is called the scaling function

The high pass is the wavelet function

Different wavelets give better results depending on the type of data

Detail/High pass/Wavelet

Approx/Low pass/Scaling

From mathworks.com

Choosing the Right WaveletThe Wavelet

The wavelets that turned out to give the best results were the Biorthogonal wavelets

These were discovered by Daubechies and make use of the fact that exact reconstruction is impossible if you use the same wavelet.

Instead a reconstruction wavelet and a decomposition wavelet are used that are slightly different

These are the coefficients of thefilters used for convolution

Actual wavelet and scaling functionsFrom mathworks.com

Testing Methodology In order to find what was the best combination of wavelet,

decomposition, and thresholding, an exhaustive search was done with Matlab

A 1000x1000 grid of vorticity data from the navier stokes simulator was first compressed with gzip. This was the baseline file to compare against

Then each available wavelet in Matlab was tested with 1, 3 and 5 level decomposition, in combination with thresholding by removing values smaller than 1x10-4 to 1x10-7.

The resulting data was saved and compressed with gzip and compared against the baseline.

Then the data was reconstructed and the max and average error was taken.

Testing MethodologyDesc Wavelet ZeroLimit OldSize NewSize Ratio MaxError AvgError StdDevMultiLevel5 bior3.1 0.0001 6348259 49242 128.92 3.02992E-05 1.54739E-06 8.9255E-07MultiLevel5 bior3.3 0.0001 6348259 61604 103.05 2.98889E-05 1.39991E-06 5.65952E-07MultiLevel5 db5 0.0001 6348259 67950 93.43 4.69813E-05 1.65457E-06 2.89868E-07MultiLevel5 bior3.5 0.0001 6348259 69646 91.15 3.18938E-05 1.29838E-06 5.99841E-07MultiLevel5 db4 0.0001 6348259 71276 89.07 4.23476E-05 1.95235E-06 7.95851E-07MultiLevel5 bior4.4 0.0001 6348259 74540 85.17 7.37695E-05 1.65387E-06 3.5692E-07MultiLevel5 bior5.5 0.0001 6348259 75502 84.08 6.88371E-05 2.03694E-06 6.00806E-07MultiLevel5 sym4 0.0001 6348259 77540 81.87 6.60595E-05 1.97454E-06 6.83028E-07

1 Level sym3 0.0000001 6348259 2529604 2.51 1.18008E-07 7.8547E-09 6.15159E-091 Level bior3.1 0.0000001 6348259 2213431 2.87 1.4574E-07 6.39022E-09 6.91756E-09MultiLevel5 bior3.5 0.00001 6348259 134704 47.13 2.61954E-06 2.23942E-07 1.20665E-07MultiLevel5 bior3.1 0.00001 6348259 105967 59.91 3.54416E-06 2.55347E-07 1.28242E-07MultiLevel5 bior3.7 0.00001 6348259 146749 43.26 2.64505E-06 2.19979E-07 1.1741E-07MultiLevel5 bior3.3 0.00001 6348259 124831 50.85 3.37968E-06 2.29704E-07 1.1771E-07MultiLevel5 bior3.9 0.00001 6348259 159372 39.83 2.68436E-06 2.22655E-07 1.10396E-07MultiLevel3 bior2.2 0.0000001 6348259 4072607 1.56 1.11517E-07 5.36598E-09 3.3909E-09MultiLevel5 bior2.2 0.0000001 6348259 4075903 1.56 1.12578E-07 5.62053E-09 3.35255E-09

Sample Results

For the application, I chose three of the methodologies that represented High compression/high error, medium/medium error and low compression/low error.

Matlab functions Four matlab functions were made for compression and decompression wavecompress (1D) and wavecompress2 (2D) wavedecompress (1D) and wavedecompress2 (2D)

wavecompress2 - Lossy compression for 2D data [savings] = WAVECOMPRESS2(x,mode,outputfile) compresses the 2-D data in x using the mode specified and saves to outputfile. The return value is compression ratio achieved The valid values for mode are: 1 = high compression, high error (uses bior3.1 filter and 1xE-4 limit) 2 = medium compression, medium error (uses bior3.1 filter and 1xE-5 limit) 3 = low compression, low error (uses bior5.5 filter and 1xE-7 limit) To decompress the data, see: wavedecompress2

Some Pictures

C++ Implementation

With the easy work out of the way, the next phase of the project was a C++ implementation. There were a few reasons for reinventing the wheel: I wanted to fully understand the process I could try my hand at some parallel processing I could have a native 3-D transformation And Matlab makes my computer very slow

Demo

./wavecomp –c 1 –d 2 vorticity000.dat ./wavedec vorticity000.dat.wcm

Algorithm

The basic algorithm for 1-D multi-level decomposition

1. Convolve the input with the low pass filter to get the approx. vector.

2. Convolve the input with the high pass to get the detail vector

3. Set the input = approx, and repeat for number of times to get desired decomposition level

Convolution The convolution step is tricky though because the

filters use data from before and after a specific point, which makes edges hard to handle

For signals that aren’t sized appropriately, the data must be extended. The most common way is periodically, symmetrically or with zero-padding.

The convolution algorithm:for (k = 0; k < signal size; k += 2)

int sum = 0;for (j = 0; j < filter size; j++) sum += filter[j]*input[k-j]

output[k] = sum;

Implementation

The convolution caused the most problems as many available libraries didn’t seem to do it correctly or assumed the data was periodic or symmetric.

I finally appropriated some code from the PyWavelets project that handled zero padding extension, determining the appropriate output sizes, and performing the correct convolution along the edges

2-D transformation

The 2-D transformation proved more challenging in terms of how to store the data and how decompose it.

1. Convolve each row with the low pass filter to get the approx. vector, then downsample rows

2. Convolve each row with the high pass to get the detail vector, then downsample columns

1. Convolve each remaining low pass column with low pass2. Convolve each remaining low pass col with high pass3. Convolve each high pass column with low pass4. Convolve each high pass column with high pass5. Downsample each result

3. Store the 3 detail vectors and set input = low pass/low pass. Repeat desired number of levels.

Post Transformation After the data was transformed and

stored in an appropriate data structure, an in memory gzip compression (which was oddly better than bzip2) was done on the data and it was outputted in binary format

Reconstruction is another program that does everything in reverse except uses the reconstruction filters.

There was trouble in storing and reading the data back in an appropriate form based on the decomposition structure.

DD L1 DA L1

AD L1 DD L2 DA L2

AD L2 DDL3

ADL3DAL3AA3

Storing data

2D Results

Data Size Compression Time Compression Ratio Max Error

64x64 grid .066s 2.038 1.414 x 10-4

128x128 .090s 3.359 1.43 x 10-4

256x256 .161s 6.775 1.359 x 10-4

512x512 .390s 16.901 2.218 x 10-4

1024x1024 1.474s 41.624 7.965 x 10-5

2500x2500 21.21s 295.376 (.33% of original) 4.67 x 10-5

Results for 2-D data vorticity data sets using “high” compression(uses bior3.1 wavelet. 1x10^-4 threshold)

2500x2500 went from size 50,000,016B -> 169,344B1024x1024 went from 8,388,624B -> 201,601B !!!

2D Results

2.038 3.359 6.775 16.90141.6124

295.376

050

100150200250300350

Data Size

Ratio

Compression Ratio for different 2-D data sizes

2D Results

Data Size Compression Time Compression Ratio Max Error

64x64 grid .071s .6978 9.2 x 10-8

128x128 .114s 1.350 8.95 x 10-8

256x256 .222s 2.393 9.27 x 10-8

512x512 .601s 5.678 9.23 x 10-8

1024x1024 2.133s 16.11 8.013 x 10-8

2500x2500 20.33s 23.03 1.389 x 10-7

Results for 2-D data vorticity data sets using “low” compression(uses bior5.5 wavelet (more coefficients than bior3.1). 1x10^-7 threshold)

64x64 actually increases in size because the decompositioncreates matrices whose sum is larger than the original and thethreshold level is too low.

Comparison

Compared to the adaptive subsampling presented in the thesis of Tallat..

Threshold Level

(AS 1020x1020)

Wave 1024x1024)

Wavelet Compression Ratio

Adaptive Subsampling Compression Ratio

(Best results)

10-3 82.0091 53.84

10-4 41.6241 19.57

10-5 17.256 7.378

2D Pictures

128x128 vorticity – ORIGINAL 128x128 vorticity – RECONSTRUCTEDUsing High compression

2D Pictures

Difference between 128x128 vorticity original and reconstructed

2D Pictures

Plot of 1024x1024 max difference for each row between orig. and reconstructed

3-D DataFrom: http://taco.poly.edu/WaveletSoftware/standard3D.html Similar to 2-D except more

Complicated.

My implementation:1) Take Z axis and downsample

1) Get A and D2) Take Y axis and downsample

1) Get AA, AD, DA, DD3) Take X axis and downsample

1) Get AAA, AAD, ADA, ADD,DAA, DAD, DDA, DDD

4) Take AAA and set as input, 5) Repeat for desired level of steps

Real trouble in trying to store the data in way that can be reconstructed later

DDD DDA

DAD DAA

ADD

AAD

ADA

AAA

Store next decomposition here

y

z

x

3-D Data

It’s also problematic getting 3-D data. Took vorticity frames and concatenated. Tested with 64x64 vorticity with 64 frames (so not

really 3D data).

Data Compression Ratio Max Error

64x64x64 3.29635(2,097,176B to 636,308B)

2.53188 x 10-4

3-D Visual Comparison

Reconstructed 64x64x64 vorticity data Original 64x64x64 vorticity data

Parallel Processing The detail and approximation data can be calculated

independently. XML-RPC was used to send an input vector to one node to find

the detail data and another to find the approximation data. The master node coordinates the sending and receiving. This led to an enormous slowdown in performance as expected.

XML-RPC adds a huge overhead Data is sent one row at a time instead of sending the entire level

decomposition. This creates excessive communication In the 2-D decomposition, when convolving the columns, four

operations can be done in parallel, but instead are only done two at a time

Performance was not the main goal here, rather it was a proof of concept.

Demo

Master Node ./wavecomp –c 1 –d 2 -p -m -s 132.239.55.175 132.239.55.174

vorticity001.dat

Slave Node ./wavecomp –p -s

Parallel Processing Results

Data Size Running Time

64x64 grid .880s

128x128 2.64s

256x256 8.76s

512x512 33.49s

Results for 2-D data vorticity data sets using “high” compressionParallel Processing on two nodes

Processing Time

0.88 2.648.76

33.49

0

10

20

30

40

64 128 256 512

Data Size

Tim

e (s

)Single Parallel

Known Issues

This method is not applicable for all kinds of data. If enough values are not thresholded (because the

limit is too low or the wavelet wasn’t appropriate), then the size can actually increase (because decomposition usually creates detail and approx vectors larger in sum than the original)

My implementation does excessive data copying, which could be eliminated, speeding up the time for processing. It comes down to the question of whether the transformations should be done in-place (which is tricky because sizes can change)

Conclusion

Lossy compression is applicable for many kinds of data, but the user should have a basic understanding of the thresholding required

Wavelets are a good choice for doing such compression, as evidenced by other applications and these results.

The finer the resolution of the data, the better the compression

References The following helped significantly.

Matlab Wavelet Toolbox: http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/wavelet.html

Robi Polikar – Wavelet Tutorial - http://users.rowan.edu/~polikar/WAVELETS/WTtutorial.html

PyWavelets - http://www.pybytes.com/pywavelets/ Geoff Davis – Wavelet Construction Kit -

http://www.geoffdavis.net/dartmouth/wavelet/wavelet.html Wickerhauser – Adapted Wavelet Analysis from Theory to

Software

Scientific data compression through wavelet transformation chris fleizach cse262.

Documents

Transcript of Scientific data compression through wavelet transformation chris fleizach cse262.