Matrix Factorizations for Parallel Integer Transforms

Matrix Factorizations for Parallel Integer Transforms

Yiyuan She1,2,3, Pengwei Hao1,2, Yakup Paker2

1Center for Information Science, Peking University

2Queen Mary, University of London

3Department of Statistics, Stanford University

Contents

1. Introduction2. Point & block factorizations3. Parallel ERM factorization (PERM)4. Parallel computational complexity5. Matrix blocking strategy6. Conclusions

Why integer transform reversible?

store

commE

ncoding

Spatial Transform

B/W image

Color image

Multi-component image

Color Space T

MCTD

ecoding

Inverse Spatial T

Color image

Multi-component image

InverseColor T

IMCT

B/W image

Lossless?

Lossless? Lossless?

How to implement?• Wavelet construction

S transform (Blume & Fand, 1989)

TS transform (Zandi et al, 1995)

S+P transform (Said & Pearlman, 1996)

• Ladder structure (Bruekers & van den Enden, 1992)

• Lifting scheme (2D, Sweldens, 1996)

• Approximated color transform (Gormish et al, 1997)

• General wavelet transform (2D, Daubechies et al, 1998)

Matrix factorizationsP. Hao and Q. Shi, Invertible linear transforms

implemented by integer mapping, Science in China, Series E (in Chinese), 2000, 30, pp. 132-141.

P. Hao and Q. Shi, Matrix factorizations for reversible integer mapping, IEEE Trans. Signal Processing, 2001, 49 pp. 2314-2324.

P. Hao and Q. Shi, Proposal of reversible integer implementation for multiple component transforms, ISO/IEC JTC1/SC29/WG1N1720, Arles, France, 2000.

Y. She and P. Hao, A block TERM factorization of nonsingular uniform block matrices, Science in China, Series E (in Chinese), 2004, 34(2).

Can we make it more efficient?

Less factor matricesLess rounding errorInteger computationParallel computing

How to increase the degree of parallelism?

b

[ ]

x j y+

+

b

x1/j

b

[ ]

-

+

Elementary reversible structure

• Integer factor: j• Flexible rounding: round(), floor(), ceil(), …• Generalized lifting scheme: for j =1, it is the same

as ladder structure and the lifting scheme• Implementation: y=jx+[b] and x=(1/j).(y+[b])

y=jx+[b] x=(1/j).(y+[b])

Elementary reversible matrix (ERM)• Diagonal elements: Integer factors• Triangular ERM (TERM)

– Upper TERM– Lower TERM

• Single-row ERM (SERM)–– Only one row off-diagonal nonzeros

Tm m m= +S J e s

Point factorizations (PLUS)

0

1 1

R

N N −

==

A PLUS DLU S S Sif det det 0T

R= ≠P A D

]0,,,,[ 12100 −⋅+=+= NNT

N ssseIseIS

( ))det(,1,,1,1 APD TR Diag=

Tm m m= +S I e s

Block factorizations (BLUS)

0

1 1

R

N N −

==

A PLUS DLU S S Sif ( ) ( ) existsT

R=DET P A DET D

0 0 1 2 1[ , , , ,0]TN N N −= + = + ⋅S I e s I e s s s

( ), , , , ( )TR Diag=D I I I DET P A

Tm m m= +S I e s

Parallel factorizations (PERM)(1) (2) ( )(0) (1) (2) ( 1) ( )

1 2

Kn n nK KN m m m m m N−= → → → =

( )

1(1) (2) ( ) ( ) ( ) ( ) (1) (1) ( ) ( )

1( ) ( ) kK K K K k k

nk K=

= = ∏A P P P D L U L U PD S S

PERM(0)

PERM(1)

Parallel computing PERM(0)

x P y(1)1S (1)

2S (1)3S (1)

4S (2)1S (2)

2S (2)3S (2)

4S

Parallel computing PERM(1)

x (1)0S (1)

1S (1)2S (1)

3S (1)4S (2)

0S (2)1S (2)

2S (2)3S (2)

4S P y

Parallel multiplication

For p processors to implement multiplications of n pairs of numbers

the computational time is:

* nTp

=

Parallel additionx

1S x

1

1

1

1

1

1

1

1

1

1

1

1

(1,5)(1,6)(1,7)(1,8)(1,9)(1,10)(1,11)(1,12)(1,13)(1,14)(1,15)(1,16)

SSSSSSSSSSSS

[ ]

2

2

log if 2/ log if 2

n n pT

n p C p n p+ < = + ≥

Computational complexity *

( )

(1)* ( ) ( ) ( 1) ( )

12 2

( 1) 2 ( ) 2 1 2

1

( 1) ( ) /

1 ( ) ( )

Kk k k k

PERMk

Kk k

k

T n m m m p

N Nm mp p

−

=

−

=

= + −

−≈ − =

∑

∑

( 0)* ( ) ( ) ( 1) ( ) 1

( 1)1

( 1) ( )1 1 1 2

1

( ) /

( ) ( )

Kk k k k

kPERMk

Kk k

k

NT n m m m pm

N N N Nm mp p

−−

=

−

=

= −

−≈ − =

∑

∑

For n(k)m(k)= m(k–1), m(0)=N1, m(K)=N2 , the parallel multiplication time is:

It’s independent of the blocking manners.

(1) (2) ( )(0) (1) (2) ( 1) ( )1 2


Computational complexity +

For n(k)m(k)= m(k–1), m(0)=N1, m(K)=N2 , the parallel addition time:

There is a turning point Kp, where

is close to but less than 2p.

(1) (2) ( )(0) (1) (2) ( 1) ( )1 2


( ) ( )(0 )( ) ( ) ( 1) ( ) ( ) 1

2 2 ( 1)1

( ) ( 1) ( ) 12 ( 1)

( ) / log log

log ( )

p

p

Kk k k k k

kPERMk

Kk k k

kk K

NT n m m m p p C p mm

Nn m mm

+ −−

=

−−

=

= − − + −

+ −

∑

∑

( ) ( )(1)( ) ( ) ( 1) ( ) ( )

2 21

( ) ( 1) ( )2

( 1) ( ) / log log

( 1) log ( )

p

p

Kk k k k k

PERMk

Kk k k

k K

T n m m m p p C p m

n m m

+ −

=

−

=

= + − − + −

+ + −

∑

∑( ) ( 1) ( )( )p p pK K Km m m− −

Blocking strategy

Since the parallel computational time has a turning point (ignoring the factors like communication time)

We propose a three-phase blocking strategy

(1) (2) ( )(0) (1) (2) ( 1) ( )1 2


if 2 :

if 2 2 :

if 2 : 1

N p N p

p N p N p

N p N

≥ → →

≤ < → →

≤ → →

Computational complexity(1) (2) ( )(0) (1) (2) ( 1) ( )

1 2


(1)

* *1 2

2* * *

2 3PERM

2*

3 4

( , ) 1 1 ( , ) 2

( , ) ( , ) 1 1 ( , ) 2 4

( , ) 5log 4

N N Nf N p p f p p pp p

N N N NT N p f N p f p p pp p

Nf N p N p

= + ⋅ − ⋅ + ≤

= = + ⋅ − + < < = ≥

( )

(1)

1 2

2

2 2 3PERM

3 4 4

( , ) 1 1 ( , ) 2

( , ) ( , ) 1 1 log ( , ) 2 4

( , ) 5log log 9 1

N N Nf N p p f p p pp p

N N N NT N p f N p C p f p p pp p

f N p N N

+ +

+ + +

+

= + ⋅ − ⋅ + ≤

= = + ⋅ − + + < <

= −2

4

Np

≥

Complexity comparison(1)

*pSERM

1( , ) ( 1) NT N p Np−= +

(1) 2pSERM

1( , ) ( 1) logNT N p N C pp

+ −= + +

p Operation O(N) O(N2)

SERM(1) O(N) O(N) Multiplications

PERM(1) O(N) O(logN) SERM(1) O(NlogN) O(NlogN)

Additions PERM(1) O(N) O(log2N)

PERM vs. parallel SERM

1

10

100

1000

10000

1 4 16 64 256 1024Number of Processors ( p )

Computational Com

plexity

PERM MultiplicationsPERM AdditionsSERM MultiplicationsSERM Additions

Computational complexity (N = 64, C = 1)

PERM vs. parallel SERM

Relative speedup( N = 64, C = 1)

0

2

4

6

8

10

1 4 16 64 256 1024Number of Processors (p )

Speedup(PERM

/SERM

)

P ERM Multiplica tio n/SERM Multiplica tio nPERM Addition/SERM Addition

ConclusionsFor parallel computing:

Increase the degree of parallelismAccommodate more processors

For sequential computing:May be more efficient for sequential computing with special matrix computation software such as BLAS

More factorization levels possibly result in greater rounding error

Thank You

[email protected]

[email protected]

http://www.dcs.qmul.ac.uk/~phao

Matrix Factorizations for Parallel Integer Transforms

Documents

Transcript of Matrix Factorizations for Parallel Integer Transforms