Relational Query Processing Approach to Compiling Sparse Matrix Codes

1

Relational Query Processing Approach to Compiling Sparse Matrix Codes

Vladimir Kotlyar

Computer Science Department, Cornell Universityhttp://www.cs.cornell.edu/Info/Project/Bernoulli

2

Outline

• Problem statement

– Sparse matrix computations

– Importance of sparse matrix formats

– Difficulties in the development of sparse matrix codes

• State-of-the-art restructuring compiler technology

• Technical approach and experimental results

• Ongoing work and conclusions

3

Sparse Matrices and Their Applications

• Number of non-zeroes per row/column << n

• Often, less that 0.1% non-zero

• Applications:– Numerical simulations, (non)linear optimization, graph

theory, information retrieval, ...

0000

0000

000000

00000

0000

0000

4

Application: numerical simulations

• Fracture mechanics Grand Challenge project:

– Cornell CS + Civil Eng. + other schools;

– supported by NSF,NASA,Boeing

• A system of differential equations is

solved over a continuous domain

• Discretized into an algebraic system in variables x(i)

• System of linear equations Ax=b is at the core

• Intuition: A is sparse because the physical interactions are local

Title:/usr/u/vladimir/private/talks/job/crack.epscCreator:MATLAB, The Mathworks, Inc.Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

5

Application: Authoritative sources on the Web

• Hubs and authorities on the Web

• Graph G=(V,E) of the documents

• A(u,v) = 1 if (u,v) is an edge

• A is sparse!

• Eigenvectors of identify hubs, authorities and their

clusters (“communities”) [Kleinberg,Raghavan ‘97]

Hubs

Authorities

AAT

6

• Solution of linear systems– Direct methods (Gaussian elimination): A = LU

• Impractical for many large-scale problems• For certain problems: O(n) space, O(n) time

– Iterative methods• Matrix-vector products: y = Ax• Triangular system solution: Lx=b• Incomplete factorizations: A LU

• Eigenvalue problems:– Mostly matrix-vector products + dense computations

Sparse matrix algorithms

7

Sparse matrix computations

• “DOANY” -- operations in any order– Vector ops (dot product, addition,scaling)– Matrix-vector products– Rarely used: C = A+B– Important: C A+B, A A + UV

• “DOACROSS” -- dependencies between operations– Triangular system solution: Lx = b

• More complex applications are built out of the above + dense kernels

• Preprocessing (e.g. storage allocation): “graph theory”

8

Outline


– Sparse matrix computations

– Sparse Matrix Storage Formats

– Difficulties in the development of sparse matrix codes


• Technical approach and experiments


9

Storing Sparse Matrices

• Compressed formats are essential– O(nnz) time/space, not O(n²)– Example: matrix-vector product

• 10M row/columns, 50 non-zeroes/row• 5 seconds vs 139 hours on a 200Mflops computer

(assuming huge memory)

• A variety of formats are used in practice– Application/architecture dependent– Different memory usage– Different performance on RISC processors

10

Point formats

jih

gfe

dc

ba

0

0

00

00

1 1 2 3 2 3 4 3 4 43 1 3 4 1 2 4 1 2 1b a d g c f j e i h

1 5 7 9 11

13

2 3 4 3 4 1 2 3 4

a c e h f i b d g j

Coordinate

Compressed Column Storage

11

0000

0000

0000

0000

000

00

tz

yx

qp

kh

gfe

dcba

Block formats

• Block Sparse Column

• “Natural” for physical problems with several unknowns at each point in space

• Saves storage: 25% for 2-by-2 blocks

• Improves performance on modern RISC processors

1 3 4 5

1 3 2 1

a e b f x… h… c…

12

Why multiple formats: performance

• Sparse matrix-vector product

• Formats: CRS, Jagged diagonal, BlockSolve

• On IBM RS6000 (66.5 MHz Power2)

• Best format depends on the application (20-70% advantage)

0

5

10

15

20

25

30

35

MF

lop

s CRS

JDIAG

Bsolve

13

Bottom line

• Sparse matrices are used in a variety of application areas

• Have to be stored in compressed data structures

• Many formats are used in practice– Different storage/performance characteristics

• Code development is tedious and error-prone– No random access– Different code for each format– Even worse in parallel (many ways to distribute the data)

14

Libraries

• Dense computations: Basic Linear Algebra Subroutines– Implemented by most computer vendors– Few formats, easy to parametrize: row/column-major,

symmetric/unsymmetric, etc

• Other computations are built on top of BLAS

• Can we do the same for sparse matrices?

15

Sparse Matrix Libraries

• Sparse Basic Linear Algebra Subroutine (SPBLAS) library

[Pozo,Remington @ NIST]– 13 formats ==> too many combinations of “A op B”– Some important ops are not supported– Not extensible

• Coarse-grain solver packages [BlockSolve,Aztec,…]– Particular class of problems/algorithms

(e.g. iterative solution)– OO approaches: hooks for basic ops

(e.g. matrix-vector product)

16

Our goal: generate sparse codes automatically

• Permit user-defined sparse data structures

• Specialize high-level algorithm for sparsity, given the formats

FOR I=1,N sum = sum + X(I)*Y(I)

FOR I=1,N such that X(I)0 and Y(I)0 sum = sum + X(I)*Y(I)

executable code

17

Input to the compiler

• FOR-loops are sequential

• DO-loops can be executed in any order (“DOANY”)

• Convert dense DO-loops into sparse code DO I=1,N; J=1,N

Y(I)=Y(I)+A(I,J)*X(J)

for(j=0; j<N;j++) for(ii=colp(j);ii < colp(j+1);ii++) Y(rowind(ii))=Y(rowind(ii))+vals(ii)*X(j);

18

Outline





19

An example: locality enhancement

• Matrix-vector product, array A stored in column/major order

FOR I=1,NFOR J=1,N

Y(I) = Y(I) + A(I,J)*X(J)

• Would like to execute the code as:

FOR J=1,NFOR I=1,N

Y(I) = Y(I) + A(I,J)*X(J)

• In general?

Stride-N

Stride-1

20

• Loop nests == polyhedra in integer spaces

FOR I=1,NFOR J=1,I

…..

• Transformations

• Used in production and research compilers (SGI, HP, IBM)

An abstraction: polyhedra

ij

Ni

1

1

i

j

j

i

21

Caveat

• The polyhedral model is not applicable to sparse computations

FOR I=1,NFOR J=1,N

IF (A(I,J) 0) THEN Y(I) = Y(I) + A(I,J)*X(J)

• Not a polyhedron

What is the right formalism?

0

1

1

),( jiA

Nj

Ni

22

Extensions for sparse matrix code generation

FOR I=1,NFOR J=1,N

IF (A(I,J) 0) THENY(I)=Y(I)+A(I,J)*X(J)

• A is sparse, compressed by column

• Interchange the loops, encapsulate the guard

FOR J=1,NFOR I=1,N such that A(I,J) 0

...

• “Control-centric” approach: transform the loops to match the

best access to data [Bik,Wijshoff]

23

Limitations of the control-centric approach

• Requires well-defined direction of access

CCS CRS COORDINATE

(J,I) loop order (I,J) loop order ????

24

Outline





25

Data-centric transformations

• Main idea: concentrate on the data

DO I=…..; J=...…..A(F(I,J))…..

• Array access function: <row,column> = F(I,J)

• Example: coordinate storage format:Title:coord.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

26

Data-centric sparse code generation

• If only a single sparse array:

FOR <row,column,value> in AI=row; J=columnY(I)=Y(I)+value*X(J)

• For each data structure provide an enumeration method

• What if more than one sparse array?– Need to produce efficient simultaneous enumeration

27

Efficient simultaneous enumeration

DO I=1,NIF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I)

• Options:– Enumerate X, search Y: “data-centric on” X– Enumerate Y, search X: “data-centric on” Y– Can speed up searching by scattering into a dense vector– If both sorted: “2-finger” merge

• Best choice depends on how X and Y are stored

• What is the general picture?

Title:dot2.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

28

An observation

DO I=1,NIF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I)

• Can view arrays as relations (as in “relational databases”)

X(i,x) Y(i,y)

• Have to enumerate solutions to the relational query

Join(X(i,x), Y(i,y))

Title:dot-relations.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

29

Connection to relational queries

• Dot product Join(X,Y)

• General case?

Dot product Equi-join

Enumerate/Search Enumerate/Search

Scatter Hash join

“2-finger” (Sort)Merge join

30

From loop nests to relational queries

DO I, J, K, ... …..A(F(I,J,K,...))…..B(G(I,J,K,...))…..

• Arrays are relations (e.g. A(r,c,a))

– Implicitly store zeros and non-zeros

• Integer space of loop variables is a relation, too: Iter(i,j,k,…)

• Access predicate S: relates loop variables and array elements

• Sparsity predicate P: “interesting” combination of zeros/non-zeros

Select(P, Select(S Bounds, Product(Iter, A, B, …)))

31

Why relational queries?

[Relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other

E.F.Codd (CACM, 1970)

• Want to separate what is to be computed from how

32

Bernoulli Sparse Compilation Toolkit

• BSCT is about 40K lines of ML + 9K lines of C

• Query optimizer at the core

• Extensible: new formats can be added

Optimizer

Instantiator

Query

Plan

Low level C code

CRS

CCS

BRS

Coordinate

….

Abstract properties

Macros

Front-end

Input program

33

Query optimization: ordering joins

select(a0 x0, join(A(i,j,a), X(j,x), Y(i,y)))

A in CCS A in CRS

Join(Join(A,X), Y) Join(Join(A,Y), X)

FOR J Join(A,X) FOR I Join(A(*,J), Y)

FOR I Join(A,Y) FOR J Join(A(I,*),X)

34

Query optimization: implementing joins

FOR I Join(A,Y) FOR J Join(A(I,*), X) .….

FOR I Merge(A,Y) H = scatter(X) FOR J enumerate A(I,*), search H …..

• Output is called a plan

35

Instantiator: executable code generation

H=scatter XFOR I Merge(A,Y) FOR J enumerate A(I,*), search H

…..

…..for(I=0; I<N; I++) for(JJ=ROWP(I); JJ < ROWP(I+1); JJ++) …..

• Macro expansion

• Open system

36

Summary of the compilation techniques

• Data-centric methodology: walk the data,compute accordingly

• Implementation for sparse arrays– arrays = relations, loop nests = queries

• Compilation path– Main steps are independent of data structure

implementations

• Parallel code generation– Ownership, communication sets,... = relations

• Difference from traditional relational databases query opt.– Selectivity of predicates not an issue; affine joins

37

Experiments

• Sequential– Kernels from SPBLAS library– Iterative solution of linear systems

• Parallel– Iterative solution of linear systems– Comparison with the BlockSolve library from Argonne NL– Comparison with the proposed “High-Performance Fortran”

standard

38

Setup

• IBM SP-2 at Cornell

• 120 MHz P2SC processor at each node– Can issue 2 multiply-add instructions per cycle– Peak performance 480 Mflops– Much lower on sparse problems: < 100 Mflops

• Benchmark matrices– From Harwell-Boeing collection– Synthetic problems

39

Matrix-vector products

• BSR = “Block Sparse Row”

• VBR = “Variable Block Sparse Row”

• BSCT_OPT = Some “Dragon Book” optimizations by hand– Loop invariant removal

BSR/MV

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

MF

LO

PS LIB

BSCT

BSCT_OPT

VBR/MV

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

Mflo

ps LIB

BSCT

BSCT_OPT

40

Solution of Triangular Systems

• Bottom line:– Can compete with the SPBLAS library

(need to implement loop invariant removal :-)

BSR/TS

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

Mfl

op

s LIBRAY

BSCT

BSCT_OPT

VBR/TS

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

MF

lop

s

LIB

BSCT

BSCT_OPT

41

Iterative solution of sparse linear systems

• Essential for large-scale simulations

• Preconditioned Conjugate Gradients (PCG) algorithm– Basic kernels: y=Ax, Lx=b +dense vector ops

• Preprocessing step– Find M such that – Incomplete Cholesky factorization (ICC): – Basic kernels: , sparse vector scaling– Can not be implemented using the SPBLAS library

• Used CCS format (“natural” for ICC)

TCCA TuvAA

IAMM 11

42

Iterative Solution

• ICC: a lot of “sparse overhead”

• Ongoing investigation (at MathWorks):

– Our compiler-generated ICC is 50-100 times faster than Matlab implementation !!

ICC/PCG

5.86 4.22 2.9 1.9

48 47 46

40

0

10

20

30

40

50

60

2000 4000 8000 16000

Matrix size

Mfl

op

s

ICC

PCG

43

Iterative solution (cont.)

• Preliminary comparison with IBM ESSL DSRIS– DSRIS implements PCG (among other things)

• On BCSSTK30; have set values to vary the convergence

• BSCT ICC takes 1.28 secs

• ESSL DSRIS preprocessing (ILU+??) takes ~5.09 secs

• PCG iterations are ~15% faster in ESSL

44

Parallel experiments

• Conjugate Gradient algorithm

– vs BlockSolve library (Argonne NL)

• “Inspector” phase

– Pre-computes what communication needs to occur

– Done once, might be expensive

• “Executor” phase

– “Receive-compute-send-...”

• On Boeing-Harwell matrices

• On synthetic grid problems to understand scalability

45

Executor performance

• Grid problems: problem size per processor is constant– 135K rows, ~4.6M non-zeroes

• Within 2-4% of the library

Executor Performance (BCSSTK32)

0

1

2

3

2 4 8 16

Number of Processors

Tim

e (S

econ

ds)

Bsolve

BSCT

Executor Performance (grid problems)

00.5

11.5

22.5

2 4 8 16 32 64

Number of processors

Tim

e (s

econ

ds)

Bsolve

BSCT

46

Inspector overhead

• Ratio of the inspector to single iteration of the executor– A problem-independent measure

• “HPF-2” -- new data-parallel Fortran standard– Lowest-common denominator, inspectors are not scalable

Inspector Overhead (BCSSTK32)

00.5

11.5

22.5

3

2 4 8 16

Number of Processors

Rat

io

Bsolve

BSCT

HPF-2

Inspector Overhead (grid problems)

02468

10

2 4 8 16 32 64

Number of processors

Rat

io

Bsolve

BSCT

HPF-2

47

Experiments: summary

• Sequential– Competitive with SPBLAS library

• Parallel– Inspector phase should exploit formats

(cf. HPF-2)

48

Outline





49

Ongoing work

• Packaging– “Library-on-demand”; as a Matlab toolbox

• Parallel code generation– Extend to handle more kernels

• Core of the compiler– Disjunctive queries, fill

50

Ongoing work

• Packaging– “Library-on-demand”; as a Matlab toolbox– Completely automatic tool; data structure selection

• Out-of-core computations

• Parallel code generation– Extend to handle more kernels

• Core of the compiler– Disjunctive queries, fill

51

Related work - compilers

• Polyhedral model [Lamport ’78, …]

• Sparse compilation [Bik,Wijshoff ‘92]

• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]– Fixed data structures– Separate compilation path for dense/sparse

• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]

52

Related work - languages and compilers

• Polyhedral model [Lamport ’78, …]

• Parallelizing compilers/parallel languages

– e.g. HPF, ZPL

• Transformational programming [Gries]

• Software engineering through views [Novak]

• Sparse compilation [Bik,Wijshoff ‘92]

• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]

• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]

53

Related work - languages and compilers

• Compilation of dense matrix (regular) codes– Polyhedral model [Lamport ‘78, ...]– Data-parallel languages (e.g. HPF, ZPL)

• Compilation of sparse matrix codes– [Bik, Wijshoff] -- sequential sparse compiler– [Saltz,Zima,Ujaldon,…] -- irregular computations in HPF-2– Fixed data structures, not extensible

• Programming with ADTs– SETL -- automatic data structure selection for set operations– Transformational systems (e.g. Polya [Gries])– Software reuse through views [Novak]

54

Related work - databases

• Optimizing loops in DB programming languages [Lieuwen,DeWitt‘90]

• Extensible database systems [Predator, …]

Utilities

Rel

atio

ns

Seq

uenc

es

Imag

es

...

OPTOPT OPT

Predator

UtilitiesC

RS

CO

OR

D

BS

R ...

BSCT

Compiler

55

Conclusions/Contributions

• Sparse matrix computations are widely used in practice

• Code development is tedious and error-prone

• Bernoulli Sparse Compilation Toolkit:– Arrays as relations, loops as queries– Compilation as query optimization

• Algebras in optimizing compilers

Data-flow analysis Lattice algebra

Dense matrix computations Polyhedral algebra

Sparse matrix computations Relational algebra

56

Future work

• Sparse compiler– Completely automatic system, data structure selection– Out-of-core computations

• Compilation of signal/image processing applications– dense matrix computations + fast transforms (e.g. DCT)– multiple data representations

• Programming languages and databases– e.g. Java and JDBC

57

Future work

• Sparse compiler– Completely automatic system, data structure selection– Out-of-core computations

• Extensible compilation– Want to support multiple formats, optimizations across basic ops– Example: signal/image processing

58

Future work

• Application area: signal/image processing

– compositions of transforms, such as FFTs

– multiple data representations

– important to optimize the flow of data through memory

hierarchies

– IBM ESSL: FFT at close to peak performance

– algebra: Kronecker products

59

Future interests

Sparse compiler

Databases

CompilersComputationalScience

Database ProgrammingSystemsData mining

Relational Query Processing Approach to Compiling Sparse Matrix Codes

Documents

Transcript of Relational Query Processing Approach to Compiling Sparse Matrix Codes