Relational Query Processing Approach to Compiling Sparse Matrix Codes

59
1 Relational Query Processing Approach to Compiling Sparse Matrix Codes Vladimir Kotlyar Computer Science Department, Cornell University http://www.cs.cornell.edu/Info/Project/Bernoulli

description

Relational Query Processing Approach to Compiling Sparse Matrix Codes. Vladimir Kotlyar Computer Science Department, Cornell University http://www.cs.cornell.edu/Info/Project/Bernoulli. Outline. Problem statement Sparse matrix computations Importance of sparse matrix formats - PowerPoint PPT Presentation

Transcript of Relational Query Processing Approach to Compiling Sparse Matrix Codes

Page 1: Relational Query Processing Approach to Compiling Sparse Matrix Codes

1

Relational Query Processing Approach to Compiling Sparse Matrix Codes

Vladimir Kotlyar

Computer Science Department, Cornell Universityhttp://www.cs.cornell.edu/Info/Project/Bernoulli

Page 2: Relational Query Processing Approach to Compiling Sparse Matrix Codes

2

Outline

• Problem statement

– Sparse matrix computations

– Importance of sparse matrix formats

– Difficulties in the development of sparse matrix codes

• State-of-the-art restructuring compiler technology

• Technical approach and experimental results

• Ongoing work and conclusions

Page 3: Relational Query Processing Approach to Compiling Sparse Matrix Codes

3

Sparse Matrices and Their Applications

• Number of non-zeroes per row/column << n

• Often, less that 0.1% non-zero

• Applications:– Numerical simulations, (non)linear optimization, graph

theory, information retrieval, ...

0000

0000

000000

00000

0000

0000

Page 4: Relational Query Processing Approach to Compiling Sparse Matrix Codes

4

Application: numerical simulations

• Fracture mechanics Grand Challenge project:

– Cornell CS + Civil Eng. + other schools;

– supported by NSF,NASA,Boeing

• A system of differential equations is

solved over a continuous domain

• Discretized into an algebraic system in variables x(i)

• System of linear equations Ax=b is at the core

• Intuition: A is sparse because the physical interactions are local

Title:/usr/u/vladimir/private/talks/job/crack.epscCreator:MATLAB, The Mathworks, Inc.Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 5: Relational Query Processing Approach to Compiling Sparse Matrix Codes

5

Application: Authoritative sources on the Web

• Hubs and authorities on the Web

• Graph G=(V,E) of the documents

• A(u,v) = 1 if (u,v) is an edge

• A is sparse!

• Eigenvectors of identify hubs, authorities and their

clusters (“communities”) [Kleinberg,Raghavan ‘97]

Hubs

Authorities

AAT

Page 6: Relational Query Processing Approach to Compiling Sparse Matrix Codes

6

• Solution of linear systems– Direct methods (Gaussian elimination): A = LU

• Impractical for many large-scale problems• For certain problems: O(n) space, O(n) time

– Iterative methods• Matrix-vector products: y = Ax• Triangular system solution: Lx=b• Incomplete factorizations: A LU

• Eigenvalue problems:– Mostly matrix-vector products + dense computations

Sparse matrix algorithms

Page 7: Relational Query Processing Approach to Compiling Sparse Matrix Codes

7

Sparse matrix computations

• “DOANY” -- operations in any order– Vector ops (dot product, addition,scaling)– Matrix-vector products– Rarely used: C = A+B– Important: C A+B, A A + UV

• “DOACROSS” -- dependencies between operations– Triangular system solution: Lx = b

• More complex applications are built out of the above + dense kernels

• Preprocessing (e.g. storage allocation): “graph theory”

Page 8: Relational Query Processing Approach to Compiling Sparse Matrix Codes

8

Outline

• Problem statement

– Sparse matrix computations

– Sparse Matrix Storage Formats

– Difficulties in the development of sparse matrix codes

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 9: Relational Query Processing Approach to Compiling Sparse Matrix Codes

9

Storing Sparse Matrices

• Compressed formats are essential– O(nnz) time/space, not O(n²)– Example: matrix-vector product

• 10M row/columns, 50 non-zeroes/row• 5 seconds vs 139 hours on a 200Mflops computer

(assuming huge memory)

• A variety of formats are used in practice– Application/architecture dependent– Different memory usage– Different performance on RISC processors

Page 10: Relational Query Processing Approach to Compiling Sparse Matrix Codes

10

Point formats

jih

gfe

dc

ba

0

0

00

00

1 1 2 3 2 3 4 3 4 43 1 3 4 1 2 4 1 2 1b a d g c f j e i h

1 5 7 9 11

13

2 3 4 3 4 1 2 3 4

a c e h f i b d g j

Coordinate

Compressed Column Storage

Page 11: Relational Query Processing Approach to Compiling Sparse Matrix Codes

11

0000

0000

0000

0000

000

00

tz

yx

qp

kh

gfe

dcba

Block formats

• Block Sparse Column

• “Natural” for physical problems with several unknowns at each point in space

• Saves storage: 25% for 2-by-2 blocks

• Improves performance on modern RISC processors

1 3 4 5

1 3 2 1

a e b f x… h… c…

Page 12: Relational Query Processing Approach to Compiling Sparse Matrix Codes

12

Why multiple formats: performance

• Sparse matrix-vector product

• Formats: CRS, Jagged diagonal, BlockSolve

• On IBM RS6000 (66.5 MHz Power2)

• Best format depends on the application (20-70% advantage)

0

5

10

15

20

25

30

35

MF

lop

s CRS

JDIAG

Bsolve

Page 13: Relational Query Processing Approach to Compiling Sparse Matrix Codes

13

Bottom line

• Sparse matrices are used in a variety of application areas

• Have to be stored in compressed data structures

• Many formats are used in practice– Different storage/performance characteristics

• Code development is tedious and error-prone– No random access– Different code for each format– Even worse in parallel (many ways to distribute the data)

Page 14: Relational Query Processing Approach to Compiling Sparse Matrix Codes

14

Libraries

• Dense computations: Basic Linear Algebra Subroutines– Implemented by most computer vendors– Few formats, easy to parametrize: row/column-major,

symmetric/unsymmetric, etc

• Other computations are built on top of BLAS

• Can we do the same for sparse matrices?

Page 15: Relational Query Processing Approach to Compiling Sparse Matrix Codes

15

Sparse Matrix Libraries

• Sparse Basic Linear Algebra Subroutine (SPBLAS) library

[Pozo,Remington @ NIST]– 13 formats ==> too many combinations of “A op B”– Some important ops are not supported– Not extensible

• Coarse-grain solver packages [BlockSolve,Aztec,…]– Particular class of problems/algorithms

(e.g. iterative solution)– OO approaches: hooks for basic ops

(e.g. matrix-vector product)

Page 16: Relational Query Processing Approach to Compiling Sparse Matrix Codes

16

Our goal: generate sparse codes automatically

• Permit user-defined sparse data structures

• Specialize high-level algorithm for sparsity, given the formats

FOR I=1,N sum = sum + X(I)*Y(I)

FOR I=1,N such that X(I)0 and Y(I)0 sum = sum + X(I)*Y(I)

executable code

Page 17: Relational Query Processing Approach to Compiling Sparse Matrix Codes

17

Input to the compiler

• FOR-loops are sequential

• DO-loops can be executed in any order (“DOANY”)

• Convert dense DO-loops into sparse code DO I=1,N; J=1,N

Y(I)=Y(I)+A(I,J)*X(J)

for(j=0; j<N;j++) for(ii=colp(j);ii < colp(j+1);ii++) Y(rowind(ii))=Y(rowind(ii))+vals(ii)*X(j);

Page 18: Relational Query Processing Approach to Compiling Sparse Matrix Codes

18

Outline

• Problem statement

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 19: Relational Query Processing Approach to Compiling Sparse Matrix Codes

19

An example: locality enhancement

• Matrix-vector product, array A stored in column/major order

FOR I=1,NFOR J=1,N

Y(I) = Y(I) + A(I,J)*X(J)

• Would like to execute the code as:

FOR J=1,NFOR I=1,N

Y(I) = Y(I) + A(I,J)*X(J)

• In general?

Stride-N

Stride-1

Page 20: Relational Query Processing Approach to Compiling Sparse Matrix Codes

20

• Loop nests == polyhedra in integer spaces

FOR I=1,NFOR J=1,I

…..

• Transformations

• Used in production and research compilers (SGI, HP, IBM)

An abstraction: polyhedra

ij

Ni

1

1

i

j

j

i

Page 21: Relational Query Processing Approach to Compiling Sparse Matrix Codes

21

Caveat

• The polyhedral model is not applicable to sparse computations

FOR I=1,NFOR J=1,N

IF (A(I,J) 0) THEN Y(I) = Y(I) + A(I,J)*X(J)

• Not a polyhedron

What is the right formalism?

0

1

1

),( jiA

Nj

Ni

Page 22: Relational Query Processing Approach to Compiling Sparse Matrix Codes

22

Extensions for sparse matrix code generation

FOR I=1,NFOR J=1,N

IF (A(I,J) 0) THENY(I)=Y(I)+A(I,J)*X(J)

• A is sparse, compressed by column

• Interchange the loops, encapsulate the guard

FOR J=1,NFOR I=1,N such that A(I,J) 0

...

• “Control-centric” approach: transform the loops to match the

best access to data [Bik,Wijshoff]

Page 23: Relational Query Processing Approach to Compiling Sparse Matrix Codes

23

Limitations of the control-centric approach

• Requires well-defined direction of access

CCS CRS COORDINATE

(J,I) loop order (I,J) loop order ????

Page 24: Relational Query Processing Approach to Compiling Sparse Matrix Codes

24

Outline

• Problem statement

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 25: Relational Query Processing Approach to Compiling Sparse Matrix Codes

25

Data-centric transformations

• Main idea: concentrate on the data

DO I=…..; J=...…..A(F(I,J))…..

• Array access function: <row,column> = F(I,J)

• Example: coordinate storage format:Title:coord.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 26: Relational Query Processing Approach to Compiling Sparse Matrix Codes

26

Data-centric sparse code generation

• If only a single sparse array:

FOR <row,column,value> in AI=row; J=columnY(I)=Y(I)+value*X(J)

• For each data structure provide an enumeration method

• What if more than one sparse array?– Need to produce efficient simultaneous enumeration

Page 27: Relational Query Processing Approach to Compiling Sparse Matrix Codes

27

Efficient simultaneous enumeration

DO I=1,NIF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I)

• Options:– Enumerate X, search Y: “data-centric on” X– Enumerate Y, search X: “data-centric on” Y– Can speed up searching by scattering into a dense vector– If both sorted: “2-finger” merge

• Best choice depends on how X and Y are stored

• What is the general picture?

Title:dot2.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 28: Relational Query Processing Approach to Compiling Sparse Matrix Codes

28

An observation

DO I=1,NIF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I)

• Can view arrays as relations (as in “relational databases”)

X(i,x) Y(i,y)

• Have to enumerate solutions to the relational query

Join(X(i,x), Y(i,y))

Title:dot-relations.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 29: Relational Query Processing Approach to Compiling Sparse Matrix Codes

29

Connection to relational queries

• Dot product Join(X,Y)

• General case?

Dot product Equi-join

Enumerate/Search Enumerate/Search

Scatter Hash join

“2-finger” (Sort)Merge join

Page 30: Relational Query Processing Approach to Compiling Sparse Matrix Codes

30

From loop nests to relational queries

DO I, J, K, ... …..A(F(I,J,K,...))…..B(G(I,J,K,...))…..

• Arrays are relations (e.g. A(r,c,a))

– Implicitly store zeros and non-zeros

• Integer space of loop variables is a relation, too: Iter(i,j,k,…)

• Access predicate S: relates loop variables and array elements

• Sparsity predicate P: “interesting” combination of zeros/non-zeros

Select(P, Select(S Bounds, Product(Iter, A, B, …)))

Page 31: Relational Query Processing Approach to Compiling Sparse Matrix Codes

31

Why relational queries?

[Relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other

E.F.Codd (CACM, 1970)

• Want to separate what is to be computed from how

Page 32: Relational Query Processing Approach to Compiling Sparse Matrix Codes

32

Bernoulli Sparse Compilation Toolkit

• BSCT is about 40K lines of ML + 9K lines of C

• Query optimizer at the core

• Extensible: new formats can be added

Optimizer

Instantiator

Query

Plan

Low level C code

CRS

CCS

BRS

Coordinate

….

Abstract properties

Macros

Front-end

Input program

Page 33: Relational Query Processing Approach to Compiling Sparse Matrix Codes

33

Query optimization: ordering joins

select(a0 x0, join(A(i,j,a), X(j,x), Y(i,y)))

A in CCS A in CRS

Join(Join(A,X), Y) Join(Join(A,Y), X)

FOR J Join(A,X) FOR I Join(A(*,J), Y)

FOR I Join(A,Y) FOR J Join(A(I,*),X)

Page 34: Relational Query Processing Approach to Compiling Sparse Matrix Codes

34

Query optimization: implementing joins

FOR I Join(A,Y) FOR J Join(A(I,*), X) .….

FOR I Merge(A,Y) H = scatter(X) FOR J enumerate A(I,*), search H …..

• Output is called a plan

Page 35: Relational Query Processing Approach to Compiling Sparse Matrix Codes

35

Instantiator: executable code generation

H=scatter XFOR I Merge(A,Y) FOR J enumerate A(I,*), search H

…..

…..for(I=0; I<N; I++) for(JJ=ROWP(I); JJ < ROWP(I+1); JJ++) …..

• Macro expansion

• Open system

Page 36: Relational Query Processing Approach to Compiling Sparse Matrix Codes

36

Summary of the compilation techniques

• Data-centric methodology: walk the data,compute accordingly

• Implementation for sparse arrays– arrays = relations, loop nests = queries

• Compilation path– Main steps are independent of data structure

implementations

• Parallel code generation– Ownership, communication sets,... = relations

• Difference from traditional relational databases query opt.– Selectivity of predicates not an issue; affine joins

Page 37: Relational Query Processing Approach to Compiling Sparse Matrix Codes

37

Experiments

• Sequential– Kernels from SPBLAS library– Iterative solution of linear systems

• Parallel– Iterative solution of linear systems– Comparison with the BlockSolve library from Argonne NL– Comparison with the proposed “High-Performance Fortran”

standard

Page 38: Relational Query Processing Approach to Compiling Sparse Matrix Codes

38

Setup

• IBM SP-2 at Cornell

• 120 MHz P2SC processor at each node– Can issue 2 multiply-add instructions per cycle– Peak performance 480 Mflops– Much lower on sparse problems: < 100 Mflops

• Benchmark matrices– From Harwell-Boeing collection– Synthetic problems

Page 39: Relational Query Processing Approach to Compiling Sparse Matrix Codes

39

Matrix-vector products

• BSR = “Block Sparse Row”

• VBR = “Variable Block Sparse Row”

• BSCT_OPT = Some “Dragon Book” optimizations by hand– Loop invariant removal

BSR/MV

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

MF

LO

PS LIB

BSCT

BSCT_OPT

VBR/MV

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

Mflo

ps LIB

BSCT

BSCT_OPT

Page 40: Relational Query Processing Approach to Compiling Sparse Matrix Codes

40

Solution of Triangular Systems

• Bottom line:– Can compete with the SPBLAS library

(need to implement loop invariant removal :-)

BSR/TS

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

Mfl

op

s LIBRAY

BSCT

BSCT_OPT

VBR/TS

0

20

40

60

80

100

5 8 11 14 17 20 25

Block size

MF

lop

s

LIB

BSCT

BSCT_OPT

Page 41: Relational Query Processing Approach to Compiling Sparse Matrix Codes

41

Iterative solution of sparse linear systems

• Essential for large-scale simulations

• Preconditioned Conjugate Gradients (PCG) algorithm– Basic kernels: y=Ax, Lx=b +dense vector ops

• Preprocessing step– Find M such that – Incomplete Cholesky factorization (ICC): – Basic kernels: , sparse vector scaling– Can not be implemented using the SPBLAS library

• Used CCS format (“natural” for ICC)

TCCA TuvAA

IAMM 11

Page 42: Relational Query Processing Approach to Compiling Sparse Matrix Codes

42

Iterative Solution

• ICC: a lot of “sparse overhead”

• Ongoing investigation (at MathWorks):

– Our compiler-generated ICC is 50-100 times faster than Matlab implementation !!

ICC/PCG

5.86 4.22 2.9 1.9

48 47 46

40

0

10

20

30

40

50

60

2000 4000 8000 16000

Matrix size

Mfl

op

s

ICC

PCG

Page 43: Relational Query Processing Approach to Compiling Sparse Matrix Codes

43

Iterative solution (cont.)

• Preliminary comparison with IBM ESSL DSRIS– DSRIS implements PCG (among other things)

• On BCSSTK30; have set values to vary the convergence

• BSCT ICC takes 1.28 secs

• ESSL DSRIS preprocessing (ILU+??) takes ~5.09 secs

• PCG iterations are ~15% faster in ESSL

Page 44: Relational Query Processing Approach to Compiling Sparse Matrix Codes

44

Parallel experiments

• Conjugate Gradient algorithm

– vs BlockSolve library (Argonne NL)

• “Inspector” phase

– Pre-computes what communication needs to occur

– Done once, might be expensive

• “Executor” phase

– “Receive-compute-send-...”

• On Boeing-Harwell matrices

• On synthetic grid problems to understand scalability

Page 45: Relational Query Processing Approach to Compiling Sparse Matrix Codes

45

Executor performance

• Grid problems: problem size per processor is constant– 135K rows, ~4.6M non-zeroes

• Within 2-4% of the library

Executor Performance (BCSSTK32)

0

1

2

3

2 4 8 16

Number of Processors

Tim

e (S

econ

ds)

Bsolve

BSCT

Executor Performance (grid problems)

00.5

11.5

22.5

2 4 8 16 32 64

Number of processors

Tim

e (s

econ

ds)

Bsolve

BSCT

Page 46: Relational Query Processing Approach to Compiling Sparse Matrix Codes

46

Inspector overhead

• Ratio of the inspector to single iteration of the executor– A problem-independent measure

• “HPF-2” -- new data-parallel Fortran standard– Lowest-common denominator, inspectors are not scalable

Inspector Overhead (BCSSTK32)

00.5

11.5

22.5

3

2 4 8 16

Number of Processors

Rat

io

Bsolve

BSCT

HPF-2

Inspector Overhead (grid problems)

02468

10

2 4 8 16 32 64

Number of processors

Rat

io

Bsolve

BSCT

HPF-2

Page 47: Relational Query Processing Approach to Compiling Sparse Matrix Codes

47

Experiments: summary

• Sequential– Competitive with SPBLAS library

• Parallel– Inspector phase should exploit formats

(cf. HPF-2)

Page 48: Relational Query Processing Approach to Compiling Sparse Matrix Codes

48

Outline

• Problem statement

• State-of-the-art restructuring compiler technology

• Technical approach and experiments

• Ongoing work and conclusions

Page 49: Relational Query Processing Approach to Compiling Sparse Matrix Codes

49

Ongoing work

• Packaging– “Library-on-demand”; as a Matlab toolbox

• Parallel code generation– Extend to handle more kernels

• Core of the compiler– Disjunctive queries, fill

Page 50: Relational Query Processing Approach to Compiling Sparse Matrix Codes

50

Ongoing work

• Packaging– “Library-on-demand”; as a Matlab toolbox– Completely automatic tool; data structure selection

• Out-of-core computations

• Parallel code generation– Extend to handle more kernels

• Core of the compiler– Disjunctive queries, fill

Page 51: Relational Query Processing Approach to Compiling Sparse Matrix Codes

51

Related work - compilers

• Polyhedral model [Lamport ’78, …]

• Sparse compilation [Bik,Wijshoff ‘92]

• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]– Fixed data structures– Separate compilation path for dense/sparse

• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]

Page 52: Relational Query Processing Approach to Compiling Sparse Matrix Codes

52

Related work - languages and compilers

• Polyhedral model [Lamport ’78, …]

• Parallelizing compilers/parallel languages

– e.g. HPF, ZPL

• Transformational programming [Gries]

• Software engineering through views [Novak]

• Sparse compilation [Bik,Wijshoff ‘92]

• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]

• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]

Page 53: Relational Query Processing Approach to Compiling Sparse Matrix Codes

53

Related work - languages and compilers

• Compilation of dense matrix (regular) codes– Polyhedral model [Lamport ‘78, ...]– Data-parallel languages (e.g. HPF, ZPL)

• Compilation of sparse matrix codes– [Bik, Wijshoff] -- sequential sparse compiler– [Saltz,Zima,Ujaldon,…] -- irregular computations in HPF-2– Fixed data structures, not extensible

• Programming with ADTs– SETL -- automatic data structure selection for set operations– Transformational systems (e.g. Polya [Gries])– Software reuse through views [Novak]

Page 54: Relational Query Processing Approach to Compiling Sparse Matrix Codes

54

Related work - databases

• Optimizing loops in DB programming languages [Lieuwen,DeWitt‘90]

• Extensible database systems [Predator, …]

Utilities

Rel

atio

ns

Seq

uenc

es

Imag

es

...

OPTOPT OPT

Predator

UtilitiesC

RS

CO

OR

D

BS

R ...

BSCT

Compiler

Page 55: Relational Query Processing Approach to Compiling Sparse Matrix Codes

55

Conclusions/Contributions

• Sparse matrix computations are widely used in practice

• Code development is tedious and error-prone

• Bernoulli Sparse Compilation Toolkit:– Arrays as relations, loops as queries– Compilation as query optimization

• Algebras in optimizing compilers

Data-flow analysis Lattice algebra

Dense matrix computations Polyhedral algebra

Sparse matrix computations Relational algebra

Page 56: Relational Query Processing Approach to Compiling Sparse Matrix Codes

56

Future work

• Sparse compiler– Completely automatic system, data structure selection– Out-of-core computations

• Compilation of signal/image processing applications– dense matrix computations + fast transforms (e.g. DCT)– multiple data representations

• Programming languages and databases– e.g. Java and JDBC

Page 57: Relational Query Processing Approach to Compiling Sparse Matrix Codes

57

Future work

• Sparse compiler– Completely automatic system, data structure selection– Out-of-core computations

• Extensible compilation– Want to support multiple formats, optimizations across basic ops– Example: signal/image processing

Page 58: Relational Query Processing Approach to Compiling Sparse Matrix Codes

58

Future work

• Application area: signal/image processing

– compositions of transforms, such as FFTs

– multiple data representations

– important to optimize the flow of data through memory

hierarchies

– IBM ESSL: FFT at close to peak performance

– algebra: Kronecker products

Page 59: Relational Query Processing Approach to Compiling Sparse Matrix Codes

59

Future interests

Sparse compiler

Databases

CompilersComputationalScience

Database ProgrammingSystemsData mining