Matrix Factorizations for Parallel Integer Transforms
Transcript of Matrix Factorizations for Parallel Integer Transforms
Matrix Factorizations for Parallel Integer Transforms
Yiyuan She1,2,3, Pengwei Hao1,2, Yakup Paker2
1Center for Information Science, Peking University
2Queen Mary, University of London
3Department of Statistics, Stanford University
Contents
1. Introduction2. Point & block factorizations3. Parallel ERM factorization (PERM)4. Parallel computational complexity5. Matrix blocking strategy6. Conclusions
Why integer transform reversible?
store
commE
ncoding
Spatial Transform
B/W image
Color image
Multi-component image
Color Space T
MCTD
ecoding
Inverse Spatial T
Color image
Multi-component image
InverseColor T
IMCT
B/W image
Lossless?
Lossless? Lossless?
How to implement?• Wavelet construction
S transform (Blume & Fand, 1989)
TS transform (Zandi et al, 1995)
S+P transform (Said & Pearlman, 1996)
• Ladder structure (Bruekers & van den Enden, 1992)
• Lifting scheme (2D, Sweldens, 1996)
• Approximated color transform (Gormish et al, 1997)
• General wavelet transform (2D, Daubechies et al, 1998)
Matrix factorizationsP. Hao and Q. Shi, Invertible linear transforms
implemented by integer mapping, Science in China, Series E (in Chinese), 2000, 30, pp. 132-141.
P. Hao and Q. Shi, Matrix factorizations for reversible integer mapping, IEEE Trans. Signal Processing, 2001, 49 pp. 2314-2324.
P. Hao and Q. Shi, Proposal of reversible integer implementation for multiple component transforms, ISO/IEC JTC1/SC29/WG1N1720, Arles, France, 2000.
Y. She and P. Hao, A block TERM factorization of nonsingular uniform block matrices, Science in China, Series E (in Chinese), 2004, 34(2).
Can we make it more efficient?
Less factor matricesLess rounding errorInteger computationParallel computing
How to increase the degree of parallelism?
b
[ ]
x j y+
+
b
x1/j
b
[ ]
-
+
Elementary reversible structure
• Integer factor: j• Flexible rounding: round(), floor(), ceil(), …• Generalized lifting scheme: for j =1, it is the same
as ladder structure and the lifting scheme• Implementation: y=jx+[b] and x=(1/j).(y+[b])
y=jx+[b] x=(1/j).(y+[b])
Elementary reversible matrix (ERM)• Diagonal elements: Integer factors• Triangular ERM (TERM)
– Upper TERM– Lower TERM
• Single-row ERM (SERM)–– Only one row off-diagonal nonzeros
Tm m m= +S J e s
Point factorizations (PLUS)
0
1 1
R
N N −
==
A PLUS DLU S S Sif det det 0T
R= ≠P A D
]0,,,,[ 12100 −⋅+=+= NNT
N ssseIseIS
( ))det(,1,,1,1 APD TR Diag=
Tm m m= +S I e s
Block factorizations (BLUS)
0
1 1
R
N N −
==
A PLUS DLU S S Sif ( ) ( ) existsT
R=DET P A DET D
0 0 1 2 1[ , , , ,0]TN N N −= + = + ⋅S I e s I e s s s
( ), , , , ( )TR Diag=D I I I DET P A
Tm m m= +S I e s
Parallel factorizations (PERM)(1) (2) ( )(0) (1) (2) ( 1) ( )
1 2
Kn n nK KN m m m m m N−= → → → =
( )
1(1) (2) ( ) ( ) ( ) ( ) (1) (1) ( ) ( )
1( ) ( ) kK K K K k k
nk K=
= = ∏A P P P D L U L U PD S S
PERM(0)
PERM(1)
Parallel computing PERM(0)
x P y(1)1S (1)
2S (1)3S (1)
4S (2)1S (2)
2S (2)3S (2)
4S
Parallel computing PERM(1)
x (1)0S (1)
1S (1)2S (1)
3S (1)4S (2)
0S (2)1S (2)
2S (2)3S (2)
4S P y
Parallel multiplication
For p processors to implement multiplications of n pairs of numbers
the computational time is:
* nTp
=
Parallel additionx
1S x
1
1
1
1
1
1
1
1
1
1
1
1
(1,5)(1,6)(1,7)(1,8)(1,9)(1,10)(1,11)(1,12)(1,13)(1,14)(1,15)(1,16)
SSSSSSSSSSSS
[ ]
2
2
log if 2/ log if 2
n n pT
n p C p n p+ < = + ≥
Computational complexity *
( )
(1)* ( ) ( ) ( 1) ( )
12 2
( 1) 2 ( ) 2 1 2
1
( 1) ( ) /
1 ( ) ( )
Kk k k k
PERMk
Kk k
k
T n m m m p
N Nm mp p
−
=
−
=
= + −
−≈ − =
∑
∑
( 0)* ( ) ( ) ( 1) ( ) 1
( 1)1
( 1) ( )1 1 1 2
1
( ) /
( ) ( )
Kk k k k
kPERMk
Kk k
k
NT n m m m pm
N N N Nm mp p
−−
=
−
=
= −
−≈ − =
∑
∑
For n(k)m(k)= m(k–1), m(0)=N1, m(K)=N2 , the parallel multiplication time is:
It’s independent of the blocking manners.
(1) (2) ( )(0) (1) (2) ( 1) ( )1 2
Kn n nK KN m m m m m N−= → → → =
Computational complexity +
For n(k)m(k)= m(k–1), m(0)=N1, m(K)=N2 , the parallel addition time:
There is a turning point Kp, where
is close to but less than 2p.
(1) (2) ( )(0) (1) (2) ( 1) ( )1 2
Kn n nK KN m m m m m N−= → → → =
( ) ( )(0 )( ) ( ) ( 1) ( ) ( ) 1
2 2 ( 1)1
( ) ( 1) ( ) 12 ( 1)
( ) / log log
log ( )
p
p
Kk k k k k
kPERMk
Kk k k
kk K
NT n m m m p p C p mm
Nn m mm
+ −−
=
−−
=
= − − + −
+ −
∑
∑
( ) ( )(1)( ) ( ) ( 1) ( ) ( )
2 21
( ) ( 1) ( )2
( 1) ( ) / log log
( 1) log ( )
p
p
Kk k k k k
PERMk
Kk k k
k K
T n m m m p p C p m
n m m
+ −
=
−
=
= + − − + −
+ + −
∑
∑( ) ( 1) ( )( )p p pK K Km m m− −
Blocking strategy
Since the parallel computational time has a turning point (ignoring the factors like communication time)
We propose a three-phase blocking strategy
(1) (2) ( )(0) (1) (2) ( 1) ( )1 2
Kn n nK KN m m m m m N−= → → → =
if 2 :
if 2 2 :
if 2 : 1
N p N p
p N p N p
N p N
≥ → →
≤ < → →
≤ → →
Computational complexity(1) (2) ( )(0) (1) (2) ( 1) ( )
1 2
Kn n nK KN m m m m m N−= → → → =
(1)
* *1 2
2* * *
2 3PERM
2*
3 4
( , ) 1 1 ( , ) 2
( , ) ( , ) 1 1 ( , ) 2 4
( , ) 5log 4
N N Nf N p p f p p pp p
N N N NT N p f N p f p p pp p
Nf N p N p
= + ⋅ − ⋅ + ≤
= = + ⋅ − + < < = ≥
( )
(1)
1 2
2
2 2 3PERM
3 4 4
( , ) 1 1 ( , ) 2
( , ) ( , ) 1 1 log ( , ) 2 4
( , ) 5log log 9 1
N N Nf N p p f p p pp p
N N N NT N p f N p C p f p p pp p
f N p N N
+ +
+ + +
+
= + ⋅ − ⋅ + ≤
= = + ⋅ − + + < <
= −2
4
Np
≥
Complexity comparison(1)
*pSERM
1( , ) ( 1) NT N p Np−= +
(1) 2pSERM
1( , ) ( 1) logNT N p N C pp
+ −= + +
p Operation O(N) O(N2)
SERM(1) O(N) O(N) Multiplications
PERM(1) O(N) O(logN) SERM(1) O(NlogN) O(NlogN)
Additions PERM(1) O(N) O(log2N)
PERM vs. parallel SERM
1
10
100
1000
10000
1 4 16 64 256 1024Number of Processors ( p )
Computational Com
plexity
PERM MultiplicationsPERM AdditionsSERM MultiplicationsSERM Additions
Computational complexity (N = 64, C = 1)
PERM vs. parallel SERM
Relative speedup( N = 64, C = 1)
0
2
4
6
8
10
1 4 16 64 256 1024Number of Processors (p )
Speedup(PERM
/SERM
)
P ERM Multiplica tio n/SERM Multiplica tio nPERM Addition/SERM Addition
ConclusionsFor parallel computing:
Increase the degree of parallelismAccommodate more processors
For sequential computing:May be more efficient for sequential computing with special matrix computation software such as BLAS
More factorization levels possibly result in greater rounding error