Scientific data compression through wavelet transformation chris fleizach cse262.
-
Upload
amie-mcbride -
Category
Documents
-
view
216 -
download
0
Transcript of Scientific data compression through wavelet transformation chris fleizach cse262.
scientific data compression through
wavelet transformation
chris fleizach
cse262
Problem Statement
Scientific data is hard to compress with traditional “run-length” encoders, like gzip or bzip because common patterns do not exist
If the data can be transformed into a form that retains the information but can be thresholded, it can be compressed
Thresholding removes excessively small features and replaces with a zero (much easier to compress than a 64b double)
Wavelet transforms are well suited for this purpose and have been used in image compression (jpeg 2000)
Why Wavelets?
Wavelet transforms encode more information than other techniques like Fourier transforms.
Time and frequency information is saved In practical terms, the transformation is applied to
many scales and sizes within the signal This results in vectors that encode approximation
and detail information. By separating the signals, it is easier to threshold
and remove information. Thus the data can be compressed
Wavelets in Compression
The jpeg2000 standard gave up the discrete cosine transform in favor of a wavelet transform
The FBI uses wavelets to compress fingerprint scans by 15 – 20 times the original size
Choosing the Right WaveletThe Transform
Continuous wavelet transform – the sum over the time of the signal convolved by the scaled and shifted versions of the wavelet
Unfortunately, it’s slow and generates way too much data. It’s also hard to implement
f t s , t dt
From mathworks.com
Choosing the Right WaveletThe Transform
The discrete transform - if the scales and positions are chosen based on powers of two, then the transform will be much more efficient and just as accurate.
Then the signal is sent through only two “subband” coders (which get the approximation and the detail data from the signal).
Signal decomposed bylow pass and high passfilters to get approx anddetail info.
From mathworks.com
Choosing the Right WaveletThe Decomposition
The signal can be recursively decomposed to get finer detail and more general approximation.
This is called multi-level decomposition.
A signal can be decomposed as many times as it can be divided in half.
Thus, we only have one approximation signal at the end of the process
From mathworks.com
Choosing the Right WaveletThe Wavelet
The low and high pass filters (subband coders) are in reality the wavelet that is used. There have been a wide variety of wavelets created over time
The low pass is called the scaling function
The high pass is the wavelet function
Different wavelets give better results depending on the type of data
Detail/High pass/Wavelet
Approx/Low pass/Scaling
From mathworks.com
Choosing the Right WaveletThe Wavelet
The wavelets that turned out to give the best results were the Biorthogonal wavelets
These were discovered by Daubechies and make use of the fact that exact reconstruction is impossible if you use the same wavelet.
Instead a reconstruction wavelet and a decomposition wavelet are used that are slightly different
These are the coefficients of thefilters used for convolution
Actual wavelet and scaling functionsFrom mathworks.com
Testing Methodology In order to find what was the best combination of wavelet,
decomposition, and thresholding, an exhaustive search was done with Matlab
A 1000x1000 grid of vorticity data from the navier stokes simulator was first compressed with gzip. This was the baseline file to compare against
Then each available wavelet in Matlab was tested with 1, 3 and 5 level decomposition, in combination with thresholding by removing values smaller than 1x10-4 to 1x10-7.
The resulting data was saved and compressed with gzip and compared against the baseline.
Then the data was reconstructed and the max and average error was taken.
Testing MethodologyDesc Wavelet ZeroLimit OldSize NewSize Ratio MaxError AvgError StdDevMultiLevel5 bior3.1 0.0001 6348259 49242 128.92 3.02992E-05 1.54739E-06 8.9255E-07MultiLevel5 bior3.3 0.0001 6348259 61604 103.05 2.98889E-05 1.39991E-06 5.65952E-07MultiLevel5 db5 0.0001 6348259 67950 93.43 4.69813E-05 1.65457E-06 2.89868E-07MultiLevel5 bior3.5 0.0001 6348259 69646 91.15 3.18938E-05 1.29838E-06 5.99841E-07MultiLevel5 db4 0.0001 6348259 71276 89.07 4.23476E-05 1.95235E-06 7.95851E-07MultiLevel5 bior4.4 0.0001 6348259 74540 85.17 7.37695E-05 1.65387E-06 3.5692E-07MultiLevel5 bior5.5 0.0001 6348259 75502 84.08 6.88371E-05 2.03694E-06 6.00806E-07MultiLevel5 sym4 0.0001 6348259 77540 81.87 6.60595E-05 1.97454E-06 6.83028E-07
1 Level sym3 0.0000001 6348259 2529604 2.51 1.18008E-07 7.8547E-09 6.15159E-091 Level bior3.1 0.0000001 6348259 2213431 2.87 1.4574E-07 6.39022E-09 6.91756E-09MultiLevel5 bior3.5 0.00001 6348259 134704 47.13 2.61954E-06 2.23942E-07 1.20665E-07MultiLevel5 bior3.1 0.00001 6348259 105967 59.91 3.54416E-06 2.55347E-07 1.28242E-07MultiLevel5 bior3.7 0.00001 6348259 146749 43.26 2.64505E-06 2.19979E-07 1.1741E-07MultiLevel5 bior3.3 0.00001 6348259 124831 50.85 3.37968E-06 2.29704E-07 1.1771E-07MultiLevel5 bior3.9 0.00001 6348259 159372 39.83 2.68436E-06 2.22655E-07 1.10396E-07MultiLevel3 bior2.2 0.0000001 6348259 4072607 1.56 1.11517E-07 5.36598E-09 3.3909E-09MultiLevel5 bior2.2 0.0000001 6348259 4075903 1.56 1.12578E-07 5.62053E-09 3.35255E-09
Sample Results
For the application, I chose three of the methodologies that represented High compression/high error, medium/medium error and low compression/low error.
Matlab functions Four matlab functions were made for compression and decompression wavecompress (1D) and wavecompress2 (2D) wavedecompress (1D) and wavedecompress2 (2D)
wavecompress2 - Lossy compression for 2D data [savings] = WAVECOMPRESS2(x,mode,outputfile) compresses the 2-D data in x using the mode specified and saves to outputfile. The return value is compression ratio achieved The valid values for mode are: 1 = high compression, high error (uses bior3.1 filter and 1xE-4 limit) 2 = medium compression, medium error (uses bior3.1 filter and 1xE-5 limit) 3 = low compression, low error (uses bior5.5 filter and 1xE-7 limit) To decompress the data, see: wavedecompress2
Some Pictures
C++ Implementation
With the easy work out of the way, the next phase of the project was a C++ implementation. There were a few reasons for reinventing the wheel: I wanted to fully understand the process I could try my hand at some parallel processing I could have a native 3-D transformation And Matlab makes my computer very slow
Demo
./wavecomp –c 1 –d 2 vorticity000.dat ./wavedec vorticity000.dat.wcm
Algorithm
The basic algorithm for 1-D multi-level decomposition
1. Convolve the input with the low pass filter to get the approx. vector.
2. Convolve the input with the high pass to get the detail vector
3. Set the input = approx, and repeat for number of times to get desired decomposition level
Convolution The convolution step is tricky though because the
filters use data from before and after a specific point, which makes edges hard to handle
For signals that aren’t sized appropriately, the data must be extended. The most common way is periodically, symmetrically or with zero-padding.
The convolution algorithm:for (k = 0; k < signal size; k += 2)
int sum = 0;for (j = 0; j < filter size; j++) sum += filter[j]*input[k-j]
output[k] = sum;
Implementation
The convolution caused the most problems as many available libraries didn’t seem to do it correctly or assumed the data was periodic or symmetric.
I finally appropriated some code from the PyWavelets project that handled zero padding extension, determining the appropriate output sizes, and performing the correct convolution along the edges
2-D transformation
The 2-D transformation proved more challenging in terms of how to store the data and how decompose it.
1. Convolve each row with the low pass filter to get the approx. vector, then downsample rows
2. Convolve each row with the high pass to get the detail vector, then downsample columns
1. Convolve each remaining low pass column with low pass2. Convolve each remaining low pass col with high pass3. Convolve each high pass column with low pass4. Convolve each high pass column with high pass5. Downsample each result
3. Store the 3 detail vectors and set input = low pass/low pass. Repeat desired number of levels.
Post Transformation After the data was transformed and
stored in an appropriate data structure, an in memory gzip compression (which was oddly better than bzip2) was done on the data and it was outputted in binary format
Reconstruction is another program that does everything in reverse except uses the reconstruction filters.
There was trouble in storing and reading the data back in an appropriate form based on the decomposition structure.
DD L1 DA L1
AD L1 DD L2 DA L2
AD L2 DDL3
ADL3DAL3AA3
Storing data
2D Results
Data Size Compression Time Compression Ratio Max Error
64x64 grid .066s 2.038 1.414 x 10-4
128x128 .090s 3.359 1.43 x 10-4
256x256 .161s 6.775 1.359 x 10-4
512x512 .390s 16.901 2.218 x 10-4
1024x1024 1.474s 41.624 7.965 x 10-5
2500x2500 21.21s 295.376 (.33% of original) 4.67 x 10-5
Results for 2-D data vorticity data sets using “high” compression(uses bior3.1 wavelet. 1x10^-4 threshold)
2500x2500 went from size 50,000,016B -> 169,344B1024x1024 went from 8,388,624B -> 201,601B !!!
2D Results
2.038 3.359 6.775 16.90141.6124
295.376
050
100150200250300350
Data Size
Ratio
Compression Ratio for different 2-D data sizes
2D Results
Data Size Compression Time Compression Ratio Max Error
64x64 grid .071s .6978 9.2 x 10-8
128x128 .114s 1.350 8.95 x 10-8
256x256 .222s 2.393 9.27 x 10-8
512x512 .601s 5.678 9.23 x 10-8
1024x1024 2.133s 16.11 8.013 x 10-8
2500x2500 20.33s 23.03 1.389 x 10-7
Results for 2-D data vorticity data sets using “low” compression(uses bior5.5 wavelet (more coefficients than bior3.1). 1x10^-7 threshold)
64x64 actually increases in size because the decompositioncreates matrices whose sum is larger than the original and thethreshold level is too low.
Comparison
Compared to the adaptive subsampling presented in the thesis of Tallat..
Threshold Level
(AS 1020x1020)
Wave 1024x1024)
Wavelet Compression Ratio
Adaptive Subsampling Compression Ratio
(Best results)
10-3 82.0091 53.84
10-4 41.6241 19.57
10-5 17.256 7.378
2D Pictures
128x128 vorticity – ORIGINAL 128x128 vorticity – RECONSTRUCTEDUsing High compression
2D Pictures
Difference between 128x128 vorticity original and reconstructed
2D Pictures
Plot of 1024x1024 max difference for each row between orig. and reconstructed
3-D DataFrom: http://taco.poly.edu/WaveletSoftware/standard3D.html Similar to 2-D except more
Complicated.
My implementation:1) Take Z axis and downsample
1) Get A and D2) Take Y axis and downsample
1) Get AA, AD, DA, DD3) Take X axis and downsample
1) Get AAA, AAD, ADA, ADD,DAA, DAD, DDA, DDD
4) Take AAA and set as input, 5) Repeat for desired level of steps
Real trouble in trying to store the data in way that can be reconstructed later
DDD DDA
DAD DAA
ADD
AAD
ADA
AAA
Store next decomposition here
y
z
x
3-D Data
It’s also problematic getting 3-D data. Took vorticity frames and concatenated. Tested with 64x64 vorticity with 64 frames (so not
really 3D data).
Data Compression Ratio Max Error
64x64x64 3.29635(2,097,176B to 636,308B)
2.53188 x 10-4
3-D Visual Comparison
Reconstructed 64x64x64 vorticity data Original 64x64x64 vorticity data
Parallel Processing The detail and approximation data can be calculated
independently. XML-RPC was used to send an input vector to one node to find
the detail data and another to find the approximation data. The master node coordinates the sending and receiving. This led to an enormous slowdown in performance as expected.
XML-RPC adds a huge overhead Data is sent one row at a time instead of sending the entire level
decomposition. This creates excessive communication In the 2-D decomposition, when convolving the columns, four
operations can be done in parallel, but instead are only done two at a time
Performance was not the main goal here, rather it was a proof of concept.
Demo
Master Node ./wavecomp –c 1 –d 2 -p -m -s 132.239.55.175 132.239.55.174
vorticity001.dat
Slave Node ./wavecomp –p -s
Parallel Processing Results
Data Size Running Time
64x64 grid .880s
128x128 2.64s
256x256 8.76s
512x512 33.49s
Results for 2-D data vorticity data sets using “high” compressionParallel Processing on two nodes
Processing Time
0.88 2.648.76
33.49
0
10
20
30
40
64 128 256 512
Data Size
Tim
e (s
)Single Parallel
Known Issues
This method is not applicable for all kinds of data. If enough values are not thresholded (because the
limit is too low or the wavelet wasn’t appropriate), then the size can actually increase (because decomposition usually creates detail and approx vectors larger in sum than the original)
My implementation does excessive data copying, which could be eliminated, speeding up the time for processing. It comes down to the question of whether the transformations should be done in-place (which is tricky because sizes can change)
Conclusion
Lossy compression is applicable for many kinds of data, but the user should have a basic understanding of the thresholding required
Wavelets are a good choice for doing such compression, as evidenced by other applications and these results.
The finer the resolution of the data, the better the compression
References The following helped significantly.
Matlab Wavelet Toolbox: http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/wavelet.html
Robi Polikar – Wavelet Tutorial - http://users.rowan.edu/~polikar/WAVELETS/WTtutorial.html
PyWavelets - http://www.pybytes.com/pywavelets/ Geoff Davis – Wavelet Construction Kit -
http://www.geoffdavis.net/dartmouth/wavelet/wavelet.html Wickerhauser – Adapted Wavelet Analysis from Theory to
Software